DOC PREVIEW
CMU CS 10701 - Clustering

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Eric Xing 1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008ClusteringClusteringEric XingEric XingLecture 15, March 17, 2008Reading: Chap. 9, C.B bookEric Xing 2What is clustering?z Clustering: the process of grouping a set of objects into classes of similar objectsz high intra-class similarityz low inter-class similarityz It is the commonest form of unsupervised learningz Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data where a classification of examples is givenz A common and important task that finds many applications in Science, Engineering, information Science, and other placesz Group genes that perform the same functionz Group individuals that has similar political viewz Categorize documents of similar topics z Ideality similar objects from pictures2Eric Xing 3Examplesz Peoplez Imagesz Languagez speciesEric Xing 4Issues for clusteringz What is a natural grouping among these objects?z Definition of "groupness"z What makes objects “related”?z Definition of "similarity/distance"z Representation for objectsz Vector space? Normalization?z How many clusters?z Fixed a priori?z Completely data driven?z Avoid “trivial” clusters - too large or smallz Clustering Algorithmsz Partitional algorithmsz Hierarchical algorithmsz Formal foundation and convergence3Eric Xing 5What is a natural grouping among these objects?Eric Xing 6What is Similarity?z The real meaning of similarity is a philosophical question. We will take a more pragmatic approachz Depends on representation and algorithm. For many rep./alg., easier to think in terms of a distance (rather than similarity) between vectors.Hard to define! Hard to define! ButButwe know it we know it when we see itwhen we see it4Eric Xing 7What properties should a distance measure have?z D(A,B) = D(B,A) Symmetryz D(A,A) = 0 Constancy of Self-Similarityz D(A,B) = 0 IIf A= B Positivity Separationz D(A,B) ≤ D(A,C) + D(B,C) Triangular InequalityEric Xing 8z D(A,B) = D(B,A) Symmetryz Otherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex"z D(A,A) = 0 Constancy of Self-Similarityz Otherwise you could claim "Alex looks more like Bob, than Bob does"z D(A,B) = 0 IIf A= B Positivity Separationz Otherwise there are objects in your world that are different, but you cannot tell apart.z D(A,B) ≤ D(A,C) + D(B,C) Triangular Inequalityz Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"Intuitions behind desirable distance measure properties5Eric Xing 9Distance Measures: MinkowskiMetricz Suppose two object x and y both have p featuresz The Minkowski metric is defined byz Most Common Minkowski Metricsrpiiiryxyxd ||),(∑=−=1),,,(),,,(ppyyyyxxxxLL2121==||max),( ) distance sup"(" 3,||),( distance) (Manhattan 2,||),( ) distance (Euclidean 1,iipipiiipiiiyxyxd ryxyxd ryxyxd r−=+∞=−==−==≤≤==∑∑1122112Eric Xing 10.},{max :distance sup"" :3. :distanceManhattan :2. :distanceEuclidean :1434734534222==+=+An Example43xy6Eric Xing 1111011111100001110100111001001001101716151413121110987654321GeneBGeneA. :Distance Hamming 5141001 =+=+ )#()#(z Manhattan distance is called Hamming distance when all features are binary.z Gene Expression Levels Under 17 Conditions (1-High,0-Low)Hamming distanceEric Xing 12. and where)()())((),(∑∑∑∑∑=======−×−−−=piippiippipiiipiiiyyxxyyxxyyxxyxs1111112211≤),( yxsSimilarity Measures: Correlation Coefficientz Pearson correlation coefficientz Special case: cosine distanceyxyxyxsrrrr⋅⋅=),(7Eric Xing 13Similarity Measures: Correlation CoefficientTimeGene AGene BGene ATimeGene BExpression LevelExpression LevelExpression LevelTimeGene AGene BEric Xing 14A generic technique for measuring similarityz To measure the similarity between two objects, transform one of the objects into the other, and measure how much effort it took. The measure of effort becomes the distance measure.The distance between Patty and Selma.Change dress color, 1 pointChange earring shape, 1 pointChange hair part, 1 pointD(Patty,Selma) = 3The distance between Marge and Selma.Change dress color, 1 pointAdd earrings, 1 pointDecrease height, 1 pointTake up smoking, 1 pointLose weight, 1 pointDPMarge,Selma) = 5This is called the Edit distance or theTransformation distance8Eric Xing 15Clustering Algorithmsz Partitional algorithmsz Usually start with a random (partial) partitioningz Refine it iterativelyz K means clusteringz Mixture-Model based clusteringz Hierarchical algorithmsz Bottom-up, agglomerativez Top-down, divisiveEric Xing 16Hierarchical Clusteringz Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.z Note that hierarchies are commonly used to organize information, for example in a web portal.z Yahoo! is hierarchy is manually created, we will focus on automatic creation of hierarchies in data mining.9Eric Xing 17Dendogramz A Useful Tool for Summarizing Similarity Measurementz The similarity between two objects in a dendrogram is represented as the height of the lowest internal node they share.z Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.Eric Xing 18Hierarchical Clusteringz Bottom-Up Agglomerative Clusteringz Starts with each obj in a separate cluster z then repeatedly joins the closest pair of clusters, z until there is only one cluster.The history of merging forms a binary tree or hierarchy.z Top-Down divisive z Starting with all the data in a single cluster, z Consider every possible way to divide the cluster into two. Choose the best division z And recursively operate on both sides.10Eric Xing 19Closest pair of clustersThe distance between two clusters is defined as the distance betweenz Single-Link z Nearest Neighbor: their closest members.z Complete-Link z Furthest Neighbor: their furthest members.z Centroid: z Clusters whose centroids (centers of gravity) are the most cosine-similarz Average: z average of all cross-cluster pairs.Eric Xing 20ba453652cbadcbDistance MatrixEuclidean Distance453,cbadc453652cbadcb4,, cbad(1)(2)(3)a,b,cccda,bd da,b,c,dSingle-Link Method11Eric Xing 21ba453652cbadcbDistance MatrixEuclidean Distance465,cbadc453652cbadcb6,,badc(1)(2)(3)a,bccda,bdc,da,b,c,dComplete-Link MethodEric Xing 22 a b c d a b c d2460Single-Link Complete-LinkDendrograms12Eric Xing 23Another ExampleEric Xing 24Computational Complexityz


View Full Document

CMU CS 10701 - Clustering

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?