UCI ICS 278 - Document Clustering and Topic Extraction

Unformatted text preview:

ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on “Mapping Knowledge Domains”, Beckman Center, UC Irvine, May 2003.Text MiningDocument ClusteringMixture Model ClusteringSlide 5Slide 6Mixtures of DocumentsSlide 8Slide 9Slide 10Slide 11A Document ClusterAnother Document ClusterA topic is represented as a (multinomial) distribution over wordsThe basic model….A better model….Slide 18A better model for documents….History of multi-topic modelsSlide 21Slide 22Slide 23A generative model for documentsSlide 25Slide 26Slide 27Slide 28Bayesian inferenceBayesian samplingGibbs samplingSlide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40A visual example: BarsSlide 42Slide 43Interpretable decompositionBayesian model selectionSlide 46Slide 47Slide 48Back to the bars data setPNAS corpus preprocessingRunning the algorithmA selection of topicsSlide 53Slide 54Slide 55Slide 56Scientific syntax and semanticsSlide 58Slide 59Slide 60Slide 61Slide 62Semantic topicsSyntactic classesSlide 65Slide 66Slide 67End of presentation on topic models… …. switch now to Author-topic modelRecent Results on Author-Topic ModelsSlide 70Slide 71Slide 72Slide 73Slide 74Slide 75Slide 76Slide 77Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineICS 278: Data MiningLecture 14: Document Clustering and Topic ExtractionNote: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on “Mapping Knowledge Domains”, Beckman Center, UC Irvine, May 2003.Padhraic SmythDepartment of Information and Computer ScienceUniversity of California, IrvineData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Mining•Information Retrieval•Text Classification•Text Clustering•Information ExtractionData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineDocument Clustering•Set of documents D in term-vector form–no class labels this time–want to group the documents into K groups or into a taxonomy–Each cluster hypothetically corresponds to a “topic”•Methods:–Any of the well-known clustering methods–K-means•E.g., “spherical k-means”, normalize document distances–Hierarchical clustering–Probabilistic model-based clustering methods•e.g., mixtures of multinomials•Single-topic versus multiple-topic models–Extensions to author-topic modelsData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineMixture Model ClusteringkKkkkcpp1,)|()( xxData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineMixture Model ClusteringkKkkkcpp1,)|()( xxdjkjkkcxpcp1,)|()|(xData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineMixture Model ClusteringkKkkkcpp1,)|()( xxdjkjkkcxpcp1,)|()|(xConditional Independencemodel for each component(often quite useful to first-order)Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineMixtures of Documents1 1 1 11 1 1 1111 1 11111 1 11111 11111 1 111111111 1 1111TermsDocuments1111Component 1Component 2Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine1 1 1 11 1 1 1111 1 11111 1 11111 11111 1 111111111 1 1111TermsDocuments1111Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine1 1 1 11 1 1 1111 1 11111 1 11111 11111 1 111111111 1 1111TermsDocumentsC1C1C1C1C1C1C1C2C2C2C2C2C2C21111Treat as MissingC2Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine1 1 1 11 1 1 1111 1 11111 1 11111 11111 1 111111111 1 1111TermsDocumentsC1C1C1C1C1C1C1C2C2C2C2C2C2C21111Treat as MissingP(C1|x1)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C2|x1)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)E-Step: estimate componentmembership probabilities given current parameter estimatesData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine1 1 1 11 1 1 1111 1 11111 1 11111 11111 1 111111111 1 1111TermsDocumentsC1C1C1C1C1C1C1C2C2C2C2C2C2C21111Treat as MissingP(C1|x1)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C1|..)P(C2|x1)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)P(C2|..)M-Step: use “fractional” weighted datato get new estimates of the parametersData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineA Document ClusterMost Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write 0.571 drive 0.465 problem 0.369 mail 0.364 articl 0.332 hard 0.323 work 0.319 system 0.303 good 0.296 time 0.273Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi 7.7 0.13 0.02 drive 5.7 0.47 0.08 hard 4.9 0.32 0.07 card 4.2 0.23 0.06 format 4.0 0.12 0.03 softwar 3.8 0.21 0.05 memori 3.6 0.14 0.04 install 3.6 0.14 0.04 disk 3.5 0.12 0.03 engin 3.3 0.21 0.06Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineAnother Document ClusterMost Likely Terms in Component 1weight = 0.11 : TERM p(t|k) articl 0.684 good 0.368 dai 0.363 fact 0.322 god 0.320 claim 0.294 apr 0.279 fbi 0.256 christian 0.256 group 0.239Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi 8.3 0.26


View Full Document

UCI ICS 278 - Document Clustering and Topic Extraction

Download Document Clustering and Topic Extraction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Document Clustering and Topic Extraction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Document Clustering and Topic Extraction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?