Princeton COS 435 - Overview and K-means algorithm

Unformatted text preview:

11Clustering:Overview andK-means algorithmK-Means illustrations thanks to2006 student Martin Makowiecki2Informal goal• Given set of objects and measure ofsimilarity between them, group similarobjects together• What mean by “similar”?• What is good grouping?• Computation time / quality tradeoff3General types of clustering• “Soft” versus “hard” clustering– Hard: partition the objects• each object in exactly one partition– Soft: assign degree to which object in cluster• view as probability or score• flat versus hierarchical clustering– hierarchical = clusters within clusters4Applications:Many– biology– astronomy– computer aided design of circuits– information organization– marketing– …5 Clustering in informationsearch and analysis• Group information objects⇒ discover topics? other groupings desirable• Clustering versus classifying− classifying: have pre-determined classeswith example members− clustering:− get groups of similar objects− added problem of labeling clusters by topic− e.g. common terms within cluster of docs.6Example applications in search• Query evaluation: cluster pruning (§7.1.6)− cluster all documents− choose representative for each cluster− evaluate query w.r.t. cluster reps.− evaluate query for docs in cluster(s) havingmost similar cluster rep.(s)• Results presentation: labeled clusters− cluster only query results− e.g. Clusty.com (metasearch)hard / soft? flat / hier?27Issues• What attributes represent items for clusteringpurposes?• What is measure of similarity between items?• General objects and matrix of pairwise similarities• Objects with specific properties that allow otherspecifications of measure– Most common:Objects are d-dimensional vectors» Euclidean distance» cosine similarity• What is measure of similarity between clusters?8Issues continued• Cluster goals?– Number of clusters?– flat or hierarchical clustering?– cohesiveness of clusters?• How evaluate cluster results?– relates to measure of closeness between clusters• Efficiency of clustering algorithms– large data sets => external storage• Maintain clusters in dynamic setting?• Clustering methods? - MANY!9General types of clustering methods• agglomerative versus divisive algorithms– agglomerative = bottom-up• build up clusters from single objects– divisive = top-down• break up cluster containing all objects intosmaller clusters– both agglomerative and divisive givehierarchies– hierarchy can be trivial: 1 (. . ) . . . 2 ((. . ) . ) . . 3 (((. . ) . ) . ) . 4 ((((. . ) . ) . ) . )10General types of clustering methodscont.• constructive versus iterative improvement– constructive: decide in what cluster each objectbelongs and don’t change• often faster– iterative improvement: start with a clusteringand move objects around to see if can improveclustering• often slower but better11Quality of clustering• In applications quality of clustering depends onhow well solves problem at hand• Algorithm uses measure of quality that can beoptimized, but that may or may not do a good jobof capturing application needs.• Underlying graph-theoretic problems usually NP-complete– e.g. graph partitioning• Usually algorithm not finding optimal clustering12Distance between clustersPossible definitions:I. distance between closest pair of objects withone in each cluster– called single link. . . . . . . . ^ ^II. distance between furthest pair objects, one fromeach cluster– called complete linkage. . . . . . . . ^ ^313Distance between clusters, cont.Possible definitions:III. average of pairwise distance between all pairs ofobjects, one from each– more computation• Generally no representative point for a cluster;• If Euclidean distance– centroid– bounding box14Vector model:K-means algorithm• Well known, well used• Flat clustering• Number of clusters picked ahead oftime• Iterative improvement• Uses notion of centroid• Typically uses Euclidean distance15K-means overview• Choose k points among set to cluster– Call them k centroids• For each point not selected, assign it to itsclosest centroid– All assignment give initial clustering• Until “happy” do:– Recompute centroids of clusters• New centroids may not be points of original set– Reassign all points to closest centroid• Updates clusters16An Examplestart: choose centroids and cluster17An Examplerecompute centroids18An Examplere-cluster around new centroids419An Example2nd recompute centroids and re-cluster20An Example3rd (final) recompute and re-cluster21Details for K-means• Need definition of centroid ci = 1/|Ci| ∑x for ith cluster Ci containing objects x notion of sum of objects ?• Need definition of distance to (similarity to)centroid• Typically vector model with Euclidean distance• minimizing sum of squared distances of eachpoint to its centroid = Residual Sum of SquaresRSS = ∑∑dist(ci,x)2x∈CiKi=1 x∈Ci22K-means performance• Can prove RSS decreases with eachiteration, so converge• Can achieve local optimum– No change in centroids• Running time depends on howdemanding stopping criteria• Works well in practice– speed– quality23Time Complexity of K-means• Let tdist be the time to calculate the distancebetween two objects• Each iteration time complexity:O(K*n*tdist)n = number of objects• Bound number of iterations I givingO(I*K*n*tdist)• for m-dimensional vectors:O(I*K*n*m)m large and centroids not sparse24Space Complexity of K-means• Store points and centroids– vector model: O((n + K)m)• External algorithm versus internal?– store k centroids in memory– run through points each iteration525Choosing Initial Centroids• Bad initialization leads to poor resultsOptimal Not Optimal26Choosing Initial CentroidsMany people spent much time examininghow to choose seeds• Random• Fast and easy, but often poor results• Run random multiple times, take best– Slower, and still no guarantee of results• Pre-conditioning– remove outliers• Choose seeds algorithmically– run hierarchical clustering on sample points anduse resulting centroids– Works well on small samples and for few initialcentroids27K-means weaknessNon-globular clusters28K-means weaknessWrong number of clusters29K-means weaknessOutliers and empty clusters30Real cases tend to be


View Full Document

Princeton COS 435 - Overview and K-means algorithm

Download Overview and K-means algorithm
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Overview and K-means algorithm and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Overview and K-means algorithm 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?