1 Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)CS276:!Informa*on!Retrieval!and!Web!Search!Pandu!Nayak!and!Prabhakar!Raghavan!Lecture!12:!Clustering!Introduc)on*to*Informa)on*Retrieval* !! !!Todayʼs!Topic:!Clustering!§ Document!clustering!§ Mo*va*ons!§ Document!representa*ons!§ Success!criteria!§ Clustering!algorithms!§ Par**onal!§ Hierarchical!Introduc)on*to*Informa)on*Retrieval* !! !!What!is!clustering?!§ Clustering:!the!process!of!grouping!a!set!of!objects!into!classes!of!similar!objects!§ Documents!within!a!cluster!should!be!similar.!§ Documents!from!different!clusters!should!be!dissimilar.!§ The!commonest!form!of!unsupervised*learning*§ Unsupervised!learning!=!learning!from!raw!data,!as!opposed!to!supervised!data!where!a!classifica*on!of!examples!is!given!§ A!common!and!important!task!that!finds!many!applica*ons!in!IR!and!other!places!Ch. 16 Introduc)on*to*Informa)on*Retrieval* !! !!A!data!set!with!clear!cluster!structure!§ How!would!you!design!an!algorithm!for!finding!the!three!clusters!in!this!case?!Ch. 16 Introduc)on*to*Informa)on*Retrieval* !! !!Applica*ons!of!clustering!in!IR!§ Whole!corpus!analysis/naviga*on!§ BeVer!user!interface:!search!without!typing!§ For!improving!recall!in!search!applica*ons!§ BeVer!search!results!(like!pseudo!RF)!§ For!beVer!naviga*on!of!search!results!§ Effec*ve!“user!recall”!will!be!higher!§ For!speeding!up!vector!space!retrieval!§ Cluster[based!retrieval!gives!faster!search!Sec. 16.1 Introduc)on*to*Informa)on*Retrieval* !! !!Yahoo!!Hierarchy!isnʼt*clustering!but!is*the!kind!of!output!you!want!from!clustering!dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculture biology physics CS space ... ... ... … (30) www.yahoo.com/Science ... ...2 Introduc)on*to*Informa)on*Retrieval* !! !!Google!News:!automa*c!clustering!gives!an!effec*ve!news!presenta*on!metaphor!Introduc)on*to*Informa)on*Retrieval* !! !!ScaVer/Gather:!Cu_ng,!Karger,!and!Pedersen!Sec. 16.1 Introduc)on*to*Informa)on*Retrieval* !! !!For!visualizing!a!document!collec*on!and!its!themes!§ Wise!et!al,!“Visualizing!the!non[visual”!PNNL!§ ThemeScapes,!Car*a!§ [Mountain!height!=!cluster!size]!Introduc)on*to*Informa)on*Retrieval* !! !!For!improving!search!recall!§ Cluster*hypothesis![!Documents!in!the!same!cluster!behave!similarly!with!respect!to!relevance!to!informa*on!needs!§ Therefore,!to!improve!search!recall:!§ Cluster!docs!in!corpus!a!priori!§ When!a!query!matches!a!doc!D,!also!return!other!docs!in!the!cluster!containing!D*§ Hope!if!we!do!this:!The!query!“car”!will!also!return!docs!containing!automobile!§ Because!clustering!grouped!together!docs!containing!car!with!those!containing!automobile.*Why might this happen? Sec. 16.1 Introduc)on*to*Informa)on*Retrieval* !! !!11!yippy.com – grouping search results Introduc)on*to*Informa)on*Retrieval* !! !!Issues!for!clustering!§ Representa*on!for!clustering!§ Document!representa*on!§ Vector!space?!!Normaliza*on?!§ Centroids!arenʼt!length!normalized!§ Need!a!no*on!of!similarity/distance!§ How!many!clusters?!§ Fixed!a!priori?!§ Completely!data!driven?!§ Avoid!“trivial”!clusters![!too!large!or!small!§ If!a!cluster's!too!large,!then!for!naviga*on!purposes!you've!wasted!an!extra!user!click!without!whiVling!down!the!set!of!documents!much.!Sec. 16.23 Introduc)on*to*Informa)on*Retrieval* !! !!No*on!of!similarity/distance!§ Ideal:!seman*c!similarity.!§ Prac*cal:!term[sta*s*cal!similarity!§ We!will!use!cosine!similarity.!§ Docs!as!vectors.!§ For!many!algorithms,!easier!to!think!in!terms!of!a!distance!(rather!than!similarity)!between!docs.!§ We!will!mostly!speak!of!Euclidean!distance!§ But!real!implementa*ons!use!cosine!similarity!Introduc)on*to*Informa)on*Retrieval* !! !!Clustering!Algorithms!§ Flat!algorithms!§ Usually!start!with!a!random!(par*al)!par**oning!§ Refine!it!itera*vely!§ K*means!clustering!§ (Model!based!clustering)*§ Hierarchical!algorithms!§ BoVom[up,!agglomera*ve!§ (Top[down,!divisive)!Introduc)on*to*Informa)on*Retrieval* !! !!Hard!vs.!sog!clustering!§ Hard!clustering:!Each!document!belongs!to!exactly!one!cluster!§ More!common!and!easier!to!do!§ Sog!clustering:!A!document!can!belong!to!more!than!one!cluster.!§ Makes!more!sense!for!applica*ons!like!crea*ng!browsable!hierarchies!§ You!may!want!to!put!a!pair!of!sneakers!in!two!clusters:!(i)!sports!apparel!and!(ii)!shoes!§ You!can!only!do!that!with!a!sog!clustering!approach.!§ We!wonʼt!do!sog!clustering!today.!See!IIR!16.5,!!18!Introduc)on*to*Informa)on*Retrieval* !! !!Par**oning!Algorithms!§ Par**oning!method:!Construct!a!par**on!of!n!documents!into!a!set!of!K!clusters!§ Given:!a!set!of!documents!and!the!number!K!!§ Find:!a!par**on!of!K!clusters!that!op*mizes!the!chosen!par**oning!criterion!§ Globally!op*mal!§ Intractable!for!many!objec*ve!func*ons!§ Ergo,!exhaus*vely!enumerate!all!par**ons!§ Effec*ve!heuris*c!methods:!K[means!and!K[medoids!algorithms!See also Kleinberg NIPS 2002 – impossibility for natural clustering Introduc)on*to*Informa)on*Retrieval* !! !!K[Means!§ Assumes!documents!are!real[valued!vectors.!§ Clusters!based!on!centroids*(aka!the!center*of*gravity!or!mean)!of!points!in!a!cluster,!c:!§ Reassignment!of!instances!to!clusters!is!based!on!distance!to!the!current!cluster!centroids.!§ (Or!one!can!equivalently!phrase!it!in!terms!of!similari*es)!∑∈=cxxc||1(c)µSec. 16.4 Introduc)on*to*Informa)on*Retrieval* !! !!K[Means!Algorithm!Select!K!random!docs!{s1,!s2,…!sK}!as!seeds.!Un*l!clustering!converges!(or!other!stopping!criterion):!!!!!!!For!each!doc!di:!****** *Assign!di!to!the!cluster!cj!such!that!dist(xi,!sj)!is!minimal.!!!!!!!(Next,*update*the*seeds*to*the*centroid*of*each*cluster)!!!!!!!For!each!cluster!cj!*************sj!=!µ(cj)!!Sec. 16.44 Introduc)on*to*Informa)on*Retrieval* !! !!K!Means!Example!(K=2)!Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged!
View Full Document