Information Retrieval through Various Approximate Matrix DecompositionsInformation RetrievalQuerying a Document DatabaseTerm-Document MatrixQuery VectorDocument ScoringCan we do better if we replace the matrix by an approximation?Nonnegative Matrix Factorization (NMF)NMFNMF ValidationSlide 11CUR DecompositionCUR ImplementationsSamplingComputation of UDeterministic CURCompact Matrix Decomposition (CMD) ImprovementCUR: Sampling with Replacement ValidationSampling without Replacement: Scaling vs. No ScalingCUR: Sampling without Replacement ValidationCUR ComparisonJudging Success: Precision and RecallLSI ResultsSlide 24Matrix Approximation ResultsConclusionsCompleted Project GoalsReferences1Information Retrieval through Various Approximate Matrix DecompositionsKathryn LinehanAdvisor: Dr. Dianne O’Leary2Information RetrievalExtracting information from databasesWe need an efficient way of searching large amounts of data Example: web search engine3Querying a Document DatabaseWe want to return documents that are relevant to entered search termsGiven data: •Term-Document Matrix, A •Entry ( i , j ): importance of term i in document j•Query Vector, q•Entry ( i ): importance of term i in the query4Term-Document MatrixEntry ( i, j) : weight of term i in document j 15000200000102000510002001500015 Document 1 2 3 4TermMarkTwainSamuelClemensPurpleFairyExample:Example taken from [5]5Query VectorEntry ( i ) : weight of term i in the query 000011 Example: search for “Mark Twain”TermMarkTwainSamuelClemensPurpleFairyExample taken from [5]6Document ScoringDoc 1 and Doc 3 will be returned as relevant, but Doc 2 will not TT02003015000200000102000510002001500015*000011 Document 1 2 3 4TermMarkTwainSamuelClemensPurpleFairyScoresDoc 1Doc 2Doc 3Doc 4Example taken from [5]7Can we do better if we replace the matrix by an approximation?Singular Value Decomposition (SVD) • Nonnegative Matrix Factorization (NMF)• CUR Decomposition• WHA CURA TVUA 8Nonnegative Matrix Factorization (NMF)HWA*m x n m x kk x n• W and H are nonnegative• kWHrank )(Storage: k(m + n) entries9NMFMultiplicative update algorithm of Lee and Seung found in [1]•Find W, H to minimize •Random initialization for W,H•Gradient descent method•Slow due to matrix multiplications in iteration 221FWHA 10NMF ValidationA: 5 x 3 random dense matrix. Average over 5 runs.1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 300.050.10.150.20.250.30.350.4NMF Validation: Relative Errork: rank(WH), rank(SVD) <= krelative error: Frobenius norm NMFSVD40 60 80 100 120 140 160 180 20000.10.20.30.40.50.60.70.8NMF Validation: Relative Errork: rank(WH), rank(SVD) <= krelative error: Frobenius norm NMFSVDB: 500 x 200 random sparse matrix. Average over 5 runs.11NMF Validation10 20 30 40 50 60 70 80 90 1000.50.550.60.650.70.750.80.850.90.95Iteration NumberRelative error: Frobenius NormRelative Error vs. Iteration Number NMFSVDB: 500 x 200 random sparse matrix. Rank(NMF) = 80.12CUR Decomposition**A C U Rm x n m x c c x r r x n• C (R) holds c (r) sampled and rescaled columns (rows) of A• U is computed using C and R• kCURrank )(where k is a rank parameter,Storage: (nz(C) + cr + nz(R)) entries13CUR ImplementationsCUR algorithm in [3] by Drineas, Kannan, and Mahoney•Linear time algorithm•Improvement: Compact Matrix Decomposition (CMD) in [6] by Sun, Xie, Zhang, and Faloutsos•Modification: use ideas in [4] by Drineas, Mahoney, Muthukrishnan (no longer linear time)•Other Modifications: our ideasDeterministic CUR code by G. W. Stewart [2]14SamplingColumn (Row) norm sampling [3]•Prob(col j) = (similar for row i)Subspace Sampling [4]•Uses rank-k SVD of A for column probabilities•Prob(col j) = •Uses “economy size” SVD of C for row probabilities•Prob(row i) =Sampling without replacement 22)(:,FAjAkjVkA2,:),(ciUC2:),(15Computation of ULinear U [3]: approximately solves• Optimal U: solves• FUUCAˆminˆRCCACCCUTkTTT)()(ˆ2minFUCURA )()(TTTTRRARCCCU16Deterministic CURCode by G. W. Stewart [2]Uses a RRQR algorithm that does not store Q•We only need the permutation vector•Gives us the columns (rows) for C (R)Uses an optimal U17Compact Matrix Decomposition (CMD) ImprovementRemove repeated columns (rows) in C (R)Decreases storage while still achieving the same relative error [6]Algorithm [3] [3] with CMDRuntime 0.008060 0.007153Storage 880.5 550.5Relative Error 0.820035 0.820035A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.18CUR: Sampling with Replacement ValidationA: 5 x 3 random dense matrix. Average over 5 runs.Legend: Sampling, U 5 6 7 8 9 10 11 12 13 14 1500.050.10.150.20.250.30.350.40.450.5CUR Validation: Relative Errorc/r: number of columns/rows sampledrelative error: Frobenius norm CN,LCN,OS,LS,OSVD500 600 700 800 900 1000 1100 1200 1300 1400 150000.050.10.150.20.250.30.350.4CUR Validation: Relative Errorc/r: number of columns/rows sampledrelative error: Frobenius norm CN,LCN,OS,LS,OSVD1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3x 10500.050.10.150.20.250.30.350.4CUR Validation: Relative Errorc/r: number of columns/rows sampledrelative error: Frobenius norm CN,LCN,OS,LS,OSVD19Sampling without Replacement: Scaling vs. No Scaling Invert scaling factor applied to• TkTCCU )(20CUR: Sampling without Replacement ValidationA: 5 x 3 random dense matrix. Average over 5 runs. 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 300.10.20.30.40.50.60.70.8CUR Validation: Relative Error, r = 2k, c = kk: rank(CUR) <= k, rank(SVD) <= krelative error: Frobenius norm w/o R,L,w/o Scw/o R,L,ScSVD40 60 80 100 120 140 160 180 20000.10.20.30.40.50.60.70.80.91CUR Validation: Relative Error, c = 3k, r = kk: rank(CUR) <= k,
View Full Document