5/4/09&1&Document&Clustering&and&Latent&Seman:c&Indexing&CISC689/489‐010,&Lecture&Wednesday,&April&22nd&Ben&CartereHe&Clustering&Review&• Cluster&documents&according&to&similarity&in&feature&space&• Types&of&clustering:&– Flat&or&hierarchical&– “Hard”&or&“soT”&• Clustering&algorithms:&– K‐means:&&flat,&“hard”&clusters&– Agglomera:ve&or&divisive&hierarchical:&&hierarchical,&soTer&clusters&5/4/09&2&K‐Nearest&Neighbor&Clustering&• Alterna:ve&idea:&&fixed‐size&soT&clusters&• K‐nearest)neighbor:&&for&each&item&j,&cluster&it&with&the&K&things&most&similar&to&it)• K‐nearest&neighbor&clustering&forms&one&cluster&per&item&– Clusters&overlap&– Does¬&necessarily&have&to&cluster&everything&BDDBBADACABDBDCCCCAAADBC5‐Nearest&Neighbor&Clustering&5/4/09&3&Evalua:ng&Clustering&• Clustering&will&never&be&100%&accurate&– Documents&will&be&placed&in&clusters&they&don’t&belong&in&– Documents&will&be&excluded&from&clusters&they&should&be&part&of&– A&natural&consequence&of&using&term&sta:s:cs&to&represent&the&informa:on&contained&in&documents&• Like&retrieval&and&classifica:on,&clustering&effec:veness&must&be&evaluated&• Evalua:ng&clustering&is&challenging,&since&it&is&an&unsupervised&learning&task&Evalua:ng&Clustering&• If&labels&exist,&can&use&standard&IR&metrics,&such&as&precision&and&recall&– In&this&case&we&are&evalua:ng&the&ability&of&our&algorithm&to&discover&the&“true”&latent&informa:on&• This&only&works&if&you&have&some&way&to&“match”&clusters&to&classes&• What&if&there&are&fewer&or&more&clusters&than&classes?&Class%A%Class%B%Class%C%Class%D%Cluster%1%A1&B1&C1&D1&Cluster%2%A2&B2&C2&D2&Cluster%3%A3&B3&C3&D3&Cluster%4%A4&B4&C4&D4&5/4/09&4&Evalua:ng&Clusters&• “Purity”:&&the&ra:o&between&the&number&of&documents&from&the&dominant&class&in&C&to&the&size&of&C&– Ci&is&a&cluster;&Kj&is&a&class&• Not&such&a&great&measure&– Does¬&take&into&account&coherence&of&the&class&– Op:mized&by&making&N&clusters,&one&for&each&document&Evalua:ng&Clusters&• With&no&labeled&data,&evalua:on&is&even&more&difficult&• Best&approach:&– Evaluate&the&system&that&the&clustering&is&part&of&– E.g.&if&clustering&is&used&to&aid&retrieval,&evaluate&the&cluster‐aided&retrieval&• What&kinds&of&systems&use&clustering?&5/4/09&5&Clusters&in&IR&Systems&• Automa:c&clustering&has&several&uses:&– Improving&efficiency&– Improving&effec:veness&of&index&– Improving&effec:veness&of&retrieval&Improving&Efficiency&• Clusters&can&be&a&:me‐&and&space‐saving&device:&– Cluster&all&documents&in&the&corpus&– Compare&queries&to&cluster&representa:ons&rather&than&document&representa:ons&– Store&cluster&informa:on&in&inverted&lists&rather&than&document&informa:on&• Important&in&the&70s&and&80s,¬&so&much&today&Cluster&term&frequency&Cluster&frequency&5/4/09&6&Improving&Effec:veness&• Recall&the&cluster&hypothesis:&– “Closely&associated&documents&tend&to&be&relevant&to&the&same&requests”&• We&may&be&able&to&improve&retrieval&by&finding&documents&that&are&closely&associated&with&documents&that&are&likely&to&be&relevant&– Even&if&they&don’t&contain&query&terms&– E.g.&perhaps&they&contain&related&terms&the&user&didn’t&think&of&(“industry”&&“companies”,&“businesses”,&“producers”,&…)&Improving&Effec:veness&by&Ranking&Clusters&• As&with&clustering&for&efficiency,&cluster&all&documents&before&indexing&• Store&cluster&informa:on&in&inverted&lists&(but&keep&document&informa:on&too)&• When&a&user&enters&a&query,&score&the&clusters&• Then&score&documents&within&the&top‐scoring&clusters&5/4/09&7&Improving&Effec:veness&by&Ranking&Clusters&• Two&approaches:&– Rank&the&clusters&from&highest&scoring&to&lowest&scoring;&within&each&cluster&rank&documents&from&highest&scoring&to&lowest&scoring&– Rank&the&documents&in&the&K&highest‐scoring&clusters&from&highest&score&to&lowest&score&• Both&tend&to&find&relevant&documents&that&are¬&found&with&non‐clustering&methods&– But&also&miss&relevant&documents&that&are&found&with&non‐clustering&methods&Scoring&Clusters&• A&cluster&can&be&scored&the&same&way&as&a&document&– Vector&space&model:&&cosine&similarity&between&query&vector&and&cluster¢roid&– Language&model:&• Or&other&ways:&– &&&Smoothing&with&collec:on&5/4/09&8&Using&Clusters&to&Adjust&Document&• Language&models&require&background&to&smooth&with&• Clusters&provide&a&background,&perhaps&more&focused&than&using&en:re&collec:on&• Smooth&document&language&model&with&cluster&background&first,&then&with&collec:on&background&– P(w&|&D)&–&frequency&of&term&in&document&– P(w&|&C)&–&frequency&of&term&in&all&documents&in&a&cluster&– P(w&|&G)&–&frequency&of&term&in&all&documents&in&the&collec:on&Smoothing&Document&Scores&• Basic&approach:&&linear&interpola:on&• BeHer&approach:&&smooth&cluster&with&collec:on,&then&smooth&document&with&both&5/4/09&9&Query‐Time&Clustering&• Clustering&the&en:re&collec:on&takes&a&long&:me&• Can&only&use&simple&algorithms&like&K‐means&• Instead,&cluster&the&top&documents&ranked&for&a&query&• Use&those&clusters&to&re‐score&and&re‐rank&the&documents&Does&it&Work?&• Retrieving&clusters:&• Clustering&at&index&:me,&use&clusters&for&smoothing&Verdict:&&no&apparent&improvement&Verdict:&&improvement&over&baselines&5/4/09&10&Does&it&Work?&• Clustering&at&query&:me,&use&clusters&for&smoothing&Verdict:&&improvement&over&baselines&Clusters&in&IR&–&Summary&• Index‐:me&clustering&– Saves&space&in&inverted&file&– Build&topical&hierarchies&– Retrieve&clusters&of&documents&rather&than&individual&documents&• Query‐:me&clustering&– ATer&query&is&submiHed,&cluster&results&– Possibly&detect&subtopics&or&different&interpreta:ons&of&query&5/4/09&11&Latent&Seman:c&Indexing&• Clustering&only&puts&documents&together&based&on&term&similarity&• Can&we&do&more&than&that?&– We’d&like&synonyms&and&highly&related&terms&to&count&for&more&when&calcula:ng&document&similarity&– And&words&that&have&mul:ple&senses&to&count&for&less&• How&can&we&do&this?&Linear&Algebra&Background&•
View Full Document