DOC PREVIEW
CMU CS 15826 - Text - part III

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

C. Faloutsos1CMU SCS15-826: Multimedia Databases and Data MiningText - part IIIC. Faloutsos15-826 Copyright: C. Faloutsos (2005) 2CMU SCSOutlineGoal: ‘Find similar / interesting things’• Intro to DB• Indexing - similarity search• Data Mining15-826 Copyright: C. Faloutsos (2005) 3CMU SCSIndexing - Detailed outline• primary key indexing• secondary key / multi-key indexing• spatial access methods• fractals• text• multimedia• ...15-826 Copyright: C. Faloutsos (2005) 4CMU SCSText - Detailed outline• text– problem– full text scanning– inversion– signature files– clustering – information filtering and LSI15-826 Copyright: C. Faloutsos (2005) 5CMU SCSVector Space Model and Clustering• keyword queries (vs Boolean)• each document: -> vector (HOW?)• each query: -> vector• search for ‘similar’ vectors15-826 Copyright: C. Faloutsos (2005) 6CMU SCSVector Space Model and Clustering• main idea:document...data...aaronzoodataV (= vocabulary size)‘indexing’C. Faloutsos215-826 Copyright: C. Faloutsos (2005) 7CMU SCSVector Space Model and ClusteringThen, group nearby vectors together• Q1: cluster search?• Q2: cluster generation?Two significant contributions• ranked output• relevance feedback15-826 Copyright: C. Faloutsos (2005) 8CMU SCSVector Space Model and Clustering• cluster search: visit the (k) closest superclusters; continue recursivelyCS TRsMD TRs15-826 Copyright: C. Faloutsos (2005) 9CMU SCSVector Space Model and Clustering• ranked output: easy!CS TRsMD TRs15-826 Copyright: C. Faloutsos (2005) 10CMU SCSVector Space Model and Clustering• relevance feedback (brilliant idea) [Roccio’73]CS TRsMD TRs15-826 Copyright: C. Faloutsos (2005) 11CMU SCSVector Space Model and Clustering• relevance feedback (brilliant idea) [Roccio’73]• How?CS TRsMD TRs15-826 Copyright: C. Faloutsos (2005) 12CMU SCSVector Space Model and Clustering• How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ onesCS TRsMD TRsC. Faloutsos315-826 Copyright: C. Faloutsos (2005) 13CMU SCSOutline - detailed• main idea• cluster search• cluster generation• evaluation15-826 Copyright: C. Faloutsos (2005) 14CMU SCSCluster generation• Problem:– given N points in V dimensions,– group them15-826 Copyright: C. Faloutsos (2005) 15CMU SCSCluster generation• Problem:– given N points in V dimensions,– group them15-826 Copyright: C. Faloutsos (2005) 16CMU SCSCluster generationWe need• Q1: document-to-document similarity• Q2: document-to-cluster similarity15-826 Copyright: C. Faloutsos (2005) 17CMU SCSCluster generationQ1: document-to-document similarity(recall: ‘bag of words’ representation)• D1: {‘data’, ‘retrieval’, ‘system’}• D2: {‘lung’, ‘pulmonary’, ‘system’}• distance/similarity functions?15-826 Copyright: C. Faloutsos (2005) 18CMU SCSCluster generationA1: # of words in commonA2: ........ normalized by the vocabulary sizesA3: .... etcAbout the same performance - prevailing one:cosine similarityC. Faloutsos415-826 Copyright: C. Faloutsos (2005) 19CMU SCSCluster generationcosine similarity:similarity(D1, D2) = cos(θ) = sum(v1,i* v2,i) / len(v1)/ len(v2)θD1D215-826 Copyright: C. Faloutsos (2005) 20CMU SCSCluster generationcosine similarity - observations:• related to the Euclidean distance• weights vi,j : according to tf/idfθD1D215-826 Copyright: C. Faloutsos (2005) 21CMU SCSCluster generationtf (‘term frequency’)high, if the term appears very often in this document.idf (‘inverse document frequency’)penalizes ‘common’ words, that appear in almost every document15-826 Copyright: C. Faloutsos (2005) 22CMU SCSCluster generationWe need• Q1: document-to-document similarity• Q2: document-to-cluster similarity?15-826 Copyright: C. Faloutsos (2005) 23CMU SCSCluster generation• A1: min distance (‘single-link’)• A2: max distance (‘all-link’)• A3: avg distance• A4: distance to centroid?15-826 Copyright: C. Faloutsos (2005) 24CMU SCSCluster generation• A1: min distance (‘single-link’)– leads to elongated clusters• A2: max distance (‘all-link’)– many, small, tight clusters• A3: avg distance– in between the above• A4: distance to centroid– fast to computeC. Faloutsos515-826 Copyright: C. Faloutsos (2005) 25CMU SCSCluster generationWe have• document-to-document similarity• document-to-cluster similarityQ: How to group documents into ‘natural’clusters15-826 Copyright: C. Faloutsos (2005) 26CMU SCSCluster generationA: *many-many* algorithms - in two groups [VanRijsbergen]:• theoretically sound (O(N^2))– independent of the insertion order• iterative (O(N), O(N log(N))15-826 Copyright: C. Faloutsos (2005) 27CMU SCSCluster generation - ‘sound’methods• Approach#1: dendrograms - create a hierarchy (bottom up or top-down) - choose a cut-off (how?) and cutcattiger horse cow0.10.30.815-826 Copyright: C. Faloutsos (2005) 28CMU SCSCluster generation - ‘sound’methods• Approach#2: min. some statistical criterion (eg., sum of squares from cluster centers)– like ‘k-means’– but how to decide ‘k’?15-826 Copyright: C. Faloutsos (2005) 29CMU SCSCluster generation - ‘sound’methods• Approach#3: Graph theoretic [Zahn]:– build MST;– delete edges longer than 3* std of the local average15-826 Copyright: C. Faloutsos (2005) 30CMU SCSCluster generation - ‘sound’methods• Result:• why ‘3’?• variations• Complexity?C. Faloutsos615-826 Copyright: C. Faloutsos (2005) 31CMU SCSCluster generation - ‘iterative’methodsgeneral outline:• Choose ‘seeds’ (how?)• assign each vector to its closest seed (possibly adjusting cluster centroid)• possibly, re-assign some vectors to improve clustersFast and practical, but ‘unpredictable’15-826 Copyright: C. Faloutsos (2005) 32CMU SCSCluster generation - ‘iterative’methodsgeneral outline:• Choose ‘seeds’ (how?)• assign each vector to its closest seed (possibly adjusting cluster centroid)• possibly, re-assign some vectors to improve clustersFast and practical, but ‘unpredictable’15-826 Copyright: C. Faloutsos (2005) 33CMU SCSCluster generationone way to estimate # of clusters k: the ‘cover coefficient’ [Can+] ~ SVD15-826 Copyright: C. Faloutsos (2005) 34CMU SCSOutline - detailed• main idea• cluster search• cluster generation• evaluation15-826 Copyright: C. Faloutsos (2005) 35CMU SCSEvaluation• Q: how to measure ‘goodness’ of one


View Full Document

CMU CS 15826 - Text - part III

Download Text - part III
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text - part III and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text - part III 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?