4/21/09'1'Document'Clustering'CISC489/689‐010,'Lecture'#17'Monday,'April'20th'Ben'CartereGe'ClassificaIon'Review'• Items'(documents,'web'pages,'emails)'are'represented'with'features'• Some'items'are'assigned'a'class'from'a'fixed'set'• ClassificaIon'goal:''use'known'class'assignments'to'“learn”'a'general'funcIon'f(x)'for'classifying'new'instances'• Naïve'Bayes'classifier:'''4/21/09'2'Clustering'• A'set'of'algorithms'that'aGempt'to'find'latent'(hidden)'structure'in'a'set'of'items'• Goal'is'to'idenIfy'groups'(clusters)'of'similar'items'– Two'items'in'the'same'group'should'be'similar'to'one'another'– An'item'in'one'group'should'be'dissimilar'to'an'item'in'another'group'Clustering'Example'• Suppose'I'gave'you'the'shape,'color,'vitamin'C'content,'and'price'of'various'fruits'and'asked'you'to'cluster'them'– What'criteria'would'you'use?'– How'would'you'define'similarity?'• Clustering'is'very'sensiIve'to'how'items'are'represented'and'how'similarity'is'defined!'4/21/09'3'Clustering'in'Two'Dimensions'How'would'you'cluster'these'points?'ClassificaIon'vs'Clustering'• ClassificaIon'is'supervised'– You'are'given'a'fixed'set'of'classes'– You'are'given'class'labels'for'certain'instances'– This'is'data'you'can'use'to'learn'the'classificaIon'funcIon'• Clustering'is'unsupervised '– You'are'not'given'any'informaIon'about'how'documents'should'be'grouped'– You'don’t'even'know'how'many'groups'there'should'be'– There'is'no'training'data'to'learn'from'• One'way'to'think'of'it:''learning'vs'discovery'4/21/09'4'Clustering'in'IR'• Cluster'hypothesis:'– “Closely'associated'documents'tend'to'be'relevant'to'the'same'requests”'–'van'Rijsbergen'‘79'• Document'clusters'may'capture'relevance'beGer'than'individual'documents'• Clusters'may'capture'“subtopics”'Cluster‐Based'Search'4/21/09'5'Yahoo!'Hierarchy'dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculture biology physics CS space ... ... ... … (30) www.yahoo.com/Science ... ... Not'based'on'clustering'approaches,'but'one'possible'use'of'clustering.'Example'from'“IntroducIon'to'IR”'slides'by'Hinrich'Schutze'Clustering'Algorithms'• General'outline'of'clustering'algorithms'1. Decide'how'items'will'be'represented'(e.g.,'feature'vectors)'2. Define'similarity'measure'between'pairs'or'groups'of'items'(e.g.,'cosine'similarity)'3. Determine'what'makes'a'“good”'clustering'4. IteraIvely'construct'clusters'that'are'increasingly'“good”'5. Stop'ajer'a'local/global'opImum'clustering'is'found'• Steps'3'and'4'differ'the'most'across'algorithms'4/21/09'6'Item'RepresentaIon'• Typical'representaIon'for'documents'in'IR:'– “Bag'of'words”'–'a'vector'of'terms'appearing'in'the'document'with'associated'weights'– N‐grams'– etc.'• Any'representaIon'used'in'retrieval'can'(theoreIcally)'be'used'in'clustering'or'classificaIon'– Though'specialized'representaIons'may'be'beGer'for'parIcular'tasks'Item'Similarity'• Cluster'hypothesis'suggests'that'document'similarity'should'be'based'on'informaIon'content'– Ideally'semanIc'content,'but'we'have'already'seen'how'hard'that'is'• Instead,'use'the'same'idea'as'in'query‐based'retrieval'– The'score'of'a'document'to'a'query'is'based'on'how'similar'they'are'in'the'words'they'contain'• Cosine'angle'between'vectors;'P(R'|'Q,'D);'P(Q'|'D)'– The'similarity'of'two'documents'will'be'based'on'how'similar'they'are'in'the'words'they'contain'4/21/09'7'Document'Similarity'Cosine'similarity'D1'D2'D1'D2'Euclidean'distance'D1'D2'ManhaGan'distance'“similarity”'vs'“distance”:''in'pracIce,'you'can'use'either'What'Makes'a'Good'Cluster?'• Large'vs'small?'– Is'it'OK'to'have'a'cluster'with'one'item?'– Is'it'OK'to'have'a'cluster'with'10,000'items?'• Similarity'between'items?'– Is'it'OK'for'things'in'a'cluster'to'be'very'far'apart,'as'long'as'they'are'closer'to'each'other'than'to'things'in'other'cluster?'– Is'it'OK'for'things'to'be'so'close'together'that'other'similar'things'are'excluded'from'the'cluster?'• Overlapping'vs'non‐overlapping?'– Is'it'OK'for'two'clusters'to'contain'some'items'in'common?'– Should'clusters'“nest”'within'one'another?'4/21/09'8'Example'Approaches'• “Hard”'clustering'– Every'item'is'in'only'one'cluster'• “Soj”'clustering'– Items'can'belong'to'more'than'one'cluster'– Nested'hierarchy:''item'belongs'to'a'cluster,'as'well'as'the'cluster’s'parent'cluster,'and'so'on'– Non‐nested:''item'belongs'to'two'separate'clusters'• E.g.'a'document'about'jaguar'cats'riding'in'Jaguar'cars'might'belong'to'the'“animal”'cluster'and'the'“car”'cluster'Example'Approaches'• Flat'clustering:'– No'overlap:''every'item'in'exactly'one'cluster'– K'clusters'total'– Start'with'random'groups,'then'refine'them'unIl'they'are'“good”'• Hierarchical'clustering:'– Clusters'are'nested:''a'cluster'can'be'made'up'of'two'or'more'smaller'clusters'– No'fixed'number'– Start'with'one'group'and'split'it'unIl'there'are'good'clusters'– Or'start'with'N'groups'and'agglomerate'them'unIl'there'are'good'clusters'4/21/09'9'Flat'Clustering'• Goal:''parIIon'N'documents'into'K'clusters'• Given:''N'document'feature'vectors,'a'number'K'• OpImal'algorithm:'– Try'every'possible'clustering'and'take'whichever'one'is'the'“best”'– ComputaIon'Ime:''O(KN)'• HeurisIc'approach:'– Split'documents'into'K'clusters'randomly'– Move'documents'from'one'cluster'to'another'unIl'the'clusters'seem'“good”'K‐Means'Clustering'• K‐means'is'a'parIIoning'heurisIc'• Documents'are'represented'as'vectors'• Clusters'are'represented'as'a'centroid-vector'• Basic'algorithm:'– Step'0:'Choose'K'docs'to'be'iniIal'cluster'centroids'– Step'1:'Assign'points'to'closet'centroid'– Step'2:'Recompute'cluster'centroids'– Step'3:'Goto'1'4/21/09'10'K‐Means'Clustering'Algorithm'Input:''N'documents,'a'number'K'• A[1],'A[2],'…,'A[N]':='0'• C1,'C2,'…,'CK':='iniIal'cluster'assignment'(pick'K'docs)'• do'– changed'='false'– for'each'document'Di,'i'='1'to'N'• k'='argmink'dist(Di,'Ck)'''''''''(equivalently,'k'='argmaxk'sim(Di,'Ck))'•
View Full Document