UMD AMSC 663 - Information Retrieval through Various Approximate Matrix Decompositions - D2334943

Home> Schools> University of Maryland, College Park> Applied Mathematics & Scientific Computation (AMSC) > AMSC 663> Information Retrieval through Various Approximate Matrix Decompositions

UMD AMSC 663 - Information Retrieval through Various Approximate Matrix Decompositions

School name University of Maryland, College Park

Course Amsc 663- Advanced Scientific Computing I

Pages 21

Download Save

Unformatted text preview:

Information Retrieval through Various Approximate Matrix DecompositionsQuerying a Document DatabaseSolutionsTerm-Document Matrix ApproximationMatrix Approximation ValidationNonnegative Matrix Factorization (NMF)NMFNMF ValidationCUR DecompositionCUR ImplementationsSamplingSampling ComparisonComputation of UU ComparisonCompact Matrix Decomposition (CMD) ImprovementDeterministic CURCUR ComparisonFuture Project GoalsPrecision and RecallFurther TopicsReferences1Information Retrieval through Various Approximate Matrix DecompositionsKathryn LinehanAdvisor: Dr. Dianne O’Leary2Querying a Document DatabaseWe want to return documents that are relevant to entered search termsGiven data: •Term-Document Matrix, A •Entry ( i , j ): importance of term i in document j•Query Vector, q•Entry ( i ): importance of term i in the query3SolutionsLiteral Term Matching•Compute score vector: s = qTA•Return the highest scoring documents•May not return relevant documents that do not contain the exact query termsLatent Semantic Indexing (LSI)•Same process as above, but use an approximation to A4Term-Document Matrix ApproximationStandard approximation used in LSI: rank-k SVDProject Goal: evaluate use of term-document matrix approximations other than rank-k SVD in LSI•Nonnegative Matrix Factorization (NMF)•CUR Decomposition5Matrix Approximation ValidationLet be an approximation to AAs the rank of increases, we expect the relative error, , to go to zeroMatrix approximation can be applied to any matrix A•Preliminary test matrix A: 50 x 30 random sparse matrix•Future test matrices: three large sparse term-document matricesA~FFAAA~A~6Nonnegative Matrix Factorization (NMF)Term-document matrix is nonnegativeHWA*m x n m x kk x n• W and H are nonnegative• kWHrank )(7NMFMultiplicative update algorithm of Lee and Seung found in [1]•Find W, H to minimize •Random initialization for W,H•Convergence is not guaranteed, but in practice it is very common•Slow due to matrix multiplications in iteration 221FWHA 8NMF ValidationA: 50 x 30 random sparse matrix. Average over 5 runs.5 10 15 20 25 3000.10.20.30.40.50.60.70.8NMF Validation: Relative Errorkrelative error5 10 15 20 25 30024681012NMF Validation: Run Timekrun time9CUR DecompositionTerm-document matrix is sparse**A C U Rm x n m x c c x r r x n• C (R) holds c (r) sampled and rescaled columns (rows) of A• U is computed using C and R• kCURrank )(where k is a rank parameter,10CUR ImplementationsCUR algorithm in [2] by Drineas, Kannan, and Mahoney•Linear time algorithm•Modification: use ideas in [3] by Drineas, Mahoney, Muthukrishnan (no longer linear time)•Improvement: Compact Matrix Decomposition (CMD) in [5] by Sun, Xie, Zhang, and Faloutsos•Other Modifications: our ideasDeterministic CUR code by G. W. Stewart11SamplingColumn (Row) norm sampling [2]•Prob(col j) = (similar for row i)Subspace Sampling [3]•Uses rank-k SVD of A for column probabilities•Prob(col j) = •Uses “economy size” SVD of C for row probabilities•Prob(row i) =Sampling without replacement 22)(:,FAjAkjVkA2,:),(ciUC2:),(12Sampling ComparisonA: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)5 10 15 20 25 3000.0050.010.0150.020.0250.030.0350.04CUR Validation: Run Timekrun time CN,LS,Lw/o R,L,w/o Scw/o R,L,Sc5 10 15 20 25 3000.10.20.30.40.50.60.70.80.91CUR Validation: Relative Errorkrelative error CN,LS,Lw/o R,L,w/o Scw/o R,L,Sc13Computation of ULinear algorithm U: approximately solves , where [2]Optimal U: solvesFUUCAˆminˆURU ˆ2minFUCURA 14U ComparisonA: 50 x 30 random sparse matrix. Average over 5 runs. Legend: Sampling,U5 10 15 20 25 3000.010.020.030.040.050.060.070.08CUR Validation: Run Timekrun time CN,LCN,OS,LS,O5 10 15 20 25 300.40.50.60.70.80.91CUR Validation: Relative Errorkrelative error CN,LCN,OS,LS,O15Compact Matrix Decomposition (CMD) ImprovementRemove repeated columns (rows) in C (R)Decreases storage while still achieving the same relative error [5]Algorithm [2] [2] with CMDRuntime 0.008060 0.007153Storage 880.5 550.5Relative Error 0.820035 0.820035A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.16Deterministic CURCode by G. W. StewartUses a RRQR algorithm that does not store Q•We only need the permutation vector•Gives us the columns (rows) for C (R)Uses optimal U17CUR ComparisonA: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)5 10 15 20 25 3000.10.20.30.40.50.60.70.80.91CUR Validation: Relative Errorkrelative error 5 10 15 20 25 3000.010.020.030.040.050.060.070.08CUR Validation: Run Timekrun time CN,LCN,OS,LS,Ow/o R,L,w/o Scw/o R,L,ScD18Future Project GoalsFinish investigation of CUR improvementValidate NMF and CUR using term-document matricesInvestigate storage, computation time and relative error of NMF and CURTest performance of NMF and CUR in LSI•Use average precision and recall, where the average is taken over all queries in the data set19Precision and RecallMeasurements of performance for document retrievalLet Retrieved = number of documents retrieved, Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant.Precision:Recall: Retr ievedRetRel)Retrieved( PRelevantRetR el)Retrieved( R20Further TopicsTime permitting investigations•Parallel implementations of matrix approximations•Testing performance of matrix approximations in forming a multidocument summary21ReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD AMSC 663 - Information Retrieval through Various Approximate Matrix Decompositions

Sign up for free to view:

Please select your school