Information Retrieval through Various Approximate Matrix DecompositionsQuerying a Document DatabaseSolutionsTerm-Document Matrix ApproximationMatrix Approximation ValidationNonnegative Matrix Factorization (NMF)NMFNMF ValidationCUR DecompositionCUR ImplementationsSamplingSampling ComparisonComputation of UU ComparisonCompact Matrix Decomposition (CMD) ImprovementDeterministic CURCUR ComparisonFuture Project GoalsPrecision and RecallFurther TopicsReferences1Information Retrieval through Various Approximate Matrix DecompositionsKathryn LinehanAdvisor: Dr. Dianne O’Leary2Querying a Document DatabaseWe want to return documents that are relevant to entered search termsGiven data: •Term-Document Matrix, A •Entry ( i , j ): importance of term i in document j•Query Vector, q•Entry ( i ): importance of term i in the query3SolutionsLiteral Term Matching•Compute score vector: s = qTA•Return the highest scoring documents•May not return relevant documents that do not contain the exact query termsLatent Semantic Indexing (LSI)•Same process as above, but use an approximation to A4Term-Document Matrix ApproximationStandard approximation used in LSI: rank-k SVDProject Goal: evaluate use of term-document matrix approximations other than rank-k SVD in LSI•Nonnegative Matrix Factorization (NMF)•CUR Decomposition5Matrix Approximation ValidationLet be an approximation to AAs the rank of increases, we expect the relative error, , to go to zeroMatrix approximation can be applied to any matrix A•Preliminary test matrix A: 50 x 30 random sparse matrix•Future test matrices: three large sparse term-document matricesA~FFAAA~A~6Nonnegative Matrix Factorization (NMF)Term-document matrix is nonnegativeHWA*m x n m x kk x n• W and H are nonnegative• kWHrank )(7NMFMultiplicative update algorithm of Lee and Seung found in [1]•Find W, H to minimize •Random initialization for W,H•Convergence is not guaranteed, but in practice it is very common•Slow due to matrix multiplications in iteration 221FWHA 8NMF ValidationA: 50 x 30 random sparse matrix. Average over 5 runs.5 10 15 20 25 3000.10.20.30.40.50.60.70.8NMF Validation: Relative Errorkrelative error5 10 15 20 25 30024681012NMF Validation: Run Timekrun time9CUR DecompositionTerm-document matrix is sparse**A C U Rm x n m x c c x r r x n• C (R) holds c (r) sampled and rescaled columns (rows) of A• U is computed using C and R• kCURrank )(where k is a rank parameter,10CUR ImplementationsCUR algorithm in [2] by Drineas, Kannan, and Mahoney•Linear time algorithm•Modification: use ideas in [3] by Drineas, Mahoney, Muthukrishnan (no longer linear time)•Improvement: Compact Matrix Decomposition (CMD) in [5] by Sun, Xie, Zhang, and Faloutsos•Other Modifications: our ideasDeterministic CUR code by G. W. Stewart11SamplingColumn (Row) norm sampling [2]•Prob(col j) = (similar for row i)Subspace Sampling [3]•Uses rank-k SVD of A for column probabilities•Prob(col j) = •Uses “economy size” SVD of C for row probabilities•Prob(row i) =Sampling without replacement 22)(:,FAjAkjVkA2,:),(ciUC2:),(12Sampling ComparisonA: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)5 10 15 20 25 3000.0050.010.0150.020.0250.030.0350.04CUR Validation: Run Timekrun time CN,LS,Lw/o R,L,w/o Scw/o R,L,Sc5 10 15 20 25 3000.10.20.30.40.50.60.70.80.91CUR Validation: Relative Errorkrelative error CN,LS,Lw/o R,L,w/o Scw/o R,L,Sc13Computation of ULinear algorithm U: approximately solves , where [2]Optimal U: solvesFUUCAˆminˆURU ˆ2minFUCURA 14U ComparisonA: 50 x 30 random sparse matrix. Average over 5 runs. Legend: Sampling,U5 10 15 20 25 3000.010.020.030.040.050.060.070.08CUR Validation: Run Timekrun time CN,LCN,OS,LS,O5 10 15 20 25 300.40.50.60.70.80.91CUR Validation: Relative Errorkrelative error CN,LCN,OS,LS,O15Compact Matrix Decomposition (CMD) ImprovementRemove repeated columns (rows) in C (R)Decreases storage while still achieving the same relative error [5]Algorithm [2] [2] with CMDRuntime 0.008060 0.007153Storage 880.5 550.5Relative Error 0.820035 0.820035A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.16Deterministic CURCode by G. W. StewartUses a RRQR algorithm that does not store Q•We only need the permutation vector•Gives us the columns (rows) for C (R)Uses optimal U17CUR ComparisonA: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)5 10 15 20 25 3000.10.20.30.40.50.60.70.80.91CUR Validation: Relative Errorkrelative error 5 10 15 20 25 3000.010.020.030.040.050.060.070.08CUR Validation: Run Timekrun time CN,LCN,OS,LS,Ow/o R,L,w/o Scw/o R,L,ScD18Future Project GoalsFinish investigation of CUR improvementValidate NMF and CUR using term-document matricesInvestigate storage, computation time and relative error of NMF and CURTest performance of NMF and CUR in LSI•Use average precision and recall, where the average is taken over all queries in the data set19Precision and RecallMeasurements of performance for document retrievalLet Retrieved = number of documents retrieved, Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant.Precision:Recall: Retr ievedRetRel)Retrieved( PRelevantRetR el)Retrieved( R20Further TopicsTime permitting investigations•Parallel implementations of matrix approximations•Testing performance of matrix approximations in forming a multidocument summary21ReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix
View Full Document