Rutgers University CS 536 - Clustering and Unsupervised Learning

Unformatted text preview:

1Clustering andUnsupervised LearningCS 536: Machine LearningLittman (Wu, TA)AdministrationCSGSS: attend grad student talksExam:• open book exam• 12/15 in SEC 217 from 1 to 4pmReviews:• http://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/review.html2Some ReadingNot sure what to suggest for K-Means andsingle-link hierarchical clustering.Kleinberg (2002). “An impossibilitytheorem for clustering” in NIPS-15.Landauer, Laham, and Foltz (1999).“Learning human-like knowledge bysingular value decomposition: A progressreport” in NIPS-10.K-Means ClusteringSlides from Andrew Moore (CMU).3Some DataThis could easilybe modeled bya GaussianMixture (with 5components)But let’s look at asatisfying,friendly andinfinitelypopularalternative…Lossy CompressionWant to transmitpoints, lossily,with only two bitsper point.Loss = SumSquared Errorbetween decodedand originalcoordinates.What encoder/decoder will losethe leastinformation?4First Idea00111001Break into a grid,decode eachbit-pair as themiddle of eachgrid-cell.Any better ideas?Second IdeaBreak into a grid,decode eachbit-pair as thecentroid of alldata in thatgrid-cellOther ideas?001110015K-Means• Ask user how many clusters he’d like.(say, k=5)• Randomly guess k cluster centerlocations• Each datapoint finds out which centerit’s closest to.• Each center finds the centroid of thepoints it owns…• …and jumps there• …Repeat until terminated!K-Means IllustratedFor fast implementation, see:Dan Pelleg and AndrewMoore. Accelerating Exact k-means Algorithms withGeometric Reasoning. Proc.Conference on KnowledgeDiscovery in Databases 1999,(KDD99) (available onwww.autonlab.org/pap.html)67891011K-Means Questions• What is it trying to optimize?• Are we sure it will terminate?• Are we sure it will find an optimalclustering?• How should we start it?• How could we automatically choosethe number of centers?….we’ll deal with these questions over the next few slidesDistortionGiven..• an encoder function: ENCODE : ¬m Æ [1..k]• a decoder function: DECODE : [1..k] Æ ¬mDefine…( )Â=-=Riii12)]([Distortion ENCODEDECODE xxWe may as well writeÂ=-==Riijij12)(ENCODE)(Distortionso][DECODExcxc12The Minimal DistortionWhat properties must centers c1 , c2 , … , ck havewhen distortion is minimized?Â=-=Riii12)(ENCODE)(Distortionxcx(1) xi must be encoded by its nearest center ….why?2},...,{)(ENCODE)(minarg21jikjicxcccccx-=Œ...at the minimal distortionThe Minimal Distortion(2) Partial derivative of Distortion with respect to eachcenter location must be zero. (Center at centroid).minimum) a(for 0)(2)(Distortion)()(Distortion)OwnedBy()OwnedBy(21 )OwnedBy(212)(ENCODE=--=-∂∂=∂∂-=-= ÂÂŒŒ= Œ=jjjiijiijijjkj ijiRiicccxcxcxcccxcxOwnedBy(cj ) = the setof records owned bycenter cj .13Improving the ConfigurationWhat properties can be changed for centers c1 , c2 , … , ck havewhen distortion is not minimized?(1) Change encoding so that xi is encoded by its nearest center(2) Set each center to the centroid of points it owns.There’s no point applying either operation twice in succession.But it can be profitable to alternate.…And that’s K-means!Easy to prove this procedure will terminate in a state atwhich neither (1) or (2) change the configuration. Why?Is It Optimal?• Not necessarily.• Can you invent a configuration thathas converged, but does not havethe minimum distortion?14Seeking Good Optima• Idea 1: Be careful about where you start.(Use datapoints, far apart if possible.)• Idea 2: Do many runs of K-Means, eachfrom a different random starting point.• Many ideas floating around.Common Uses of K-Means• Often used as an exploratory data analysis tool• In one-dimension, a good way to quantize real-valued variables into k non-uniform buckets• Used on acoustic data in speech understandingto convert waveforms into one of k categories(known as Vector Quantization)• Also used for choosing color palettes on oldfashioned graphical display devices!(Note: How choose numbers of centers?)15Single Linkage Hierarchical Clustering2. Find “most similar” pair ofclusters1. Say “Every point is its owncluster”4. Repeat3. Merge it into a parentclusterSome DetailsHow do we define similarity betweenclusters?• Minimum distance between points inclusters (simply Euclidian MinimumSpanning Trees)• Maximum distance between points inclusters• Average distance between points inclustersYou’re left with a nice dendrogram(taxonomy, hierarchy, tree) of datapoints16Single Linkage Comments• It’s nice that you get a hierarchy insteadof an amorphous collection of groups• If you want k groups, just cut the (k-1)longest links• There’s no real statistical or information-theoretic foundation to this.• Also known in the trade as HierarchicalAgglomerative Clustering (note theacronym)Stuff to Know• All the details of K-means• The theory behind K-means as anoptimization algorithm• How K-means can get stuck• The outline of hierarchicalclustering17Clustering DesiderataAbstractly, clustering algorithm f mapsdistance matrix D between objects topartition P of objects.What is an ideal clustering algorithm?1. Richness: for all P, exists D s.t. f(D) = P.2. Scale invariance: for all c>0 f(D) = f(cD).3. Consistency: for any D, if f(D)=P, thenfor any D’ with smaller within-clusterdistances and larger between-clusterdistances, f(D’)=P.!Any Pair of These PossibleUsing single-link clustering, we can varythe stopping condition and get:•Richness and consistency:Stop when distances are more than r.•Scale invariance and consistency:Stop when number of clusters is k.•Richness and scale invariance:Stop when distances exceed r/maxij Dij.18All Three Impossible!It is impossible to specify a clusteringalgorithm that satisfies all threeconditions simultaneously(Kleinberg 02).Essentially, the proof shows thatconsistency and scale invarianceseverely constrain what partitionscan be created.Why Clustering?Which brings up the question, whydid we want to do clusteringanyway? What is the task?Can we accomplish the same thingsby assigning distances orsimilarities to pairs of objects?19Alternatives to ClusteringAn embedding algorithm e maps distancesD to locations L in some vector space sothat dist(Li, Lj) approximates Dij.• multidimensional scaling (MDS)• singular value decomposition (SVD)• non-negative matrix factorization (NMF)• non-linear embedding (LLE, Isomap)• and moreNo


View Full Document
Download Clustering and Unsupervised Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering and Unsupervised Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering and Unsupervised Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?