DOC PREVIEW
CMU CS 15826 - Text - part IV (LSI)

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

C. Faloutsos1CMU SCS15-826: Multimedia Databases and Data MiningText - part IV (LSI)C. Faloutsos15-826 Copyright: C. Faloutsos (2005) 2CMU SCSOutlineGoal: ‘Find similar / interesting things’• Intro to DB• Indexing - similarity search• Data Mining15-826 Copyright: C. Faloutsos (2005) 3CMU SCSIndexing - Detailed outline• primary key indexing• secondary key / multi-key indexing• spatial access methods• fractals• text• SVD: a powerful tool• multimedia• ...15-826 Copyright: C. Faloutsos (2005) 4CMU SCSText - Detailed outline• text– problem– full text scanning– inversion– signature files– clustering – information filtering and LSI15-826 Copyright: C. Faloutsos (2005) 5CMU SCSLSI - Detailed outline• LSI– problem definition– main idea– experiments15-826 Copyright: C. Faloutsos (2005) 6CMU SCSInformation Filtering + LSI• [Foltz+,’92] Goal: – users specify interests (= keywords)– system alerts them, on suitable news-documents• Major contribution: LSI = Latent Semantic Indexing– latent (‘hidden’) conceptsC. Faloutsos215-826 Copyright: C. Faloutsos (2005) 7CMU SCSInformation Filtering + LSIMain idea• map each document into some ‘concepts’• map each term into some ‘concepts’‘Concept’:~ a set of terms, with weights, e.g.– “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept15-826 Copyright: C. Faloutsos (2005) 8CMU SCSInformation Filtering + LSIPictorially: term-document matrix (BEFORE)'data''system''retrieval''lung''ear'TR11 1 1TR21 1 1TR31 1TR41 115-826 Copyright: C. Faloutsos (2005) 9CMU SCSInformation Filtering + LSIPictorially: concept-document matrix and...'DBMS-concept''medical-concept'TR11TR21TR31TR4115-826 Copyright: C. Faloutsos (2005) 10CMU SCSInformation Filtering + LSI... and concept-term matrix'DBMS-concept''medical-concept'data 1system 1retrieval1lung 1ear 115-826 Copyright: C. Faloutsos (2005) 11CMU SCSInformation Filtering + LSIQ: How to search, eg., for ‘system’?15-826 Copyright: C. Faloutsos (2005) 12CMU SCSInformation Filtering + LSIA: find the corresponding concept(s); and the corresponding documents'DBMS-concept''medical-concept'data 1system 1retrieval1lung 1ear 1'DBMS-concept''medical-concept'TR1 1TR2 1TR3 1TR4 1C. Faloutsos315-826 Copyright: C. Faloutsos (2005) 13CMU SCSInformation Filtering + LSIA: find the corresponding concept(s); and the corresponding documents'DBMS-concept''medical-concept'data 1system 1retrieval1lung 1ear 1'DBMS-concept''medical-concept'TR1 1TR2 1TR3 1TR4 115-826 Copyright: C. Faloutsos (2005) 14CMU SCSInformation Filtering + LSIThus it works like an (automatically constructed) thesaurus:we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)15-826 Copyright: C. Faloutsos (2005) 15CMU SCSLSI - Detailed outline• LSI– problem definition– main idea– experiments15-826 Copyright: C. Faloutsos (2005) 16CMU SCSLSI - Experiments• 150 Tech Memos (TM) / month• 34 users submitted ‘profiles’ (6-66 words per profile)• 100-300 concepts15-826 Copyright: C. Faloutsos (2005) 17CMU SCSLSI - Experiments• four methods, cross-product of:– vector-space or LSI, for similarity scoring– keywords or document-sample, for profile specification• measured: precision/recall15-826 Copyright: C. Faloutsos (2005) 18CMU SCSLSI - Experiments• LSI, with document-based profiles, were better precisionrecall(0.25,0.65)(0.50,0.45)(0.75,0.30)C. Faloutsos415-826 Copyright: C. Faloutsos (2005) 19CMU SCSLSI - Discussion - Conclusions • Great idea, – to derive ‘concepts’ from documents– to build a ‘statistical thesaurus’ automatically– to reduce dimensionality• Often leads to better precision/recall• but:– Needs ‘training’ set of documents– ‘concept’ vectors are not sparse anymore15-826 Copyright: C. Faloutsos (2005) 20CMU SCSLSI - Discussion - Conclusions Observations• Bellcore (-> Telcordia) has a patent• used for multi-lingual retrievalHow exactly SVD works?15-826 Copyright: C. Faloutsos (2005) 21CMU SCSIndexing - Detailed outline• primary key indexing• secondary key / multi-key indexing• spatial access methods• fractals• text• SVD: a powerful tool• multimedia• ...15-826 Copyright: C. Faloutsos (2005) 22CMU SCSReferences• Foltz, P. W. and S. T. Dumais (Dec. 1992). "Personalized Information Delivery: An Analysis of Information Filtering Methods." Comm. of ACM (CACM) 35(12):


View Full Document

CMU CS 15826 - Text - part IV (LSI)

Download Text - part IV (LSI)
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text - part IV (LSI) and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text - part IV (LSI) 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?