DOC PREVIEW
CMU CS 15381 - Unsupervised Machine Learning Methods

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Artificial Intelligence 15-381 Unsupervised Machine Learning MethodsUnsupervised LearningSlide 3Similarity Measures in Data AnalysisSlide 5Similarity Measures (cont.)Similarity MatrixDocument Clustering TechniquesSlide 9Incremental Clustering MethodsIncremental Clustering (cont.)Slide 12Slide 13Non-hierarchical Clustering MethodsNon-hierarchical Methods (cont.)Hierarchical Agglomerative Clustering MethodsHierarchical Agglomerative Clustering Methods (cont.)Creating TaxonomiesTaxonomies (cont.)Artificial Intelligence 15-381Unsupervised Machine Learning MethodsJaime Carbonell1-November-2001OUTLINE:What is unsupervised learning?Similarity computationsClustering AlgorithmsOther kinds of unsupervised learningUnsupervised LearningDefinition of Unsupervised Learning:Learning useful structure without labeled classes, optimization criterion, feedback signal, or any other information beyond the raw dataUnsupervised LearningExamples:Find natural groupings of Xs (X=human languages, stocks, gene sequences, animal species,…) Prelude to discovery of underlying propertiesSummarize the news for the past monthCluster first, then report centroids.Sequence extrapolation: E.g. Predict cancer incidence next decade; predict rise in antibiotic-resistant bacteriaMethodsClustering (n-link, k-means, GAC,…)Taxonomy creation (hierarchical clustering)Novelty detection ("meaningful"outliers)Trend detection (extrapolation from multivariate partial derivatives)Similarity Measures in Data AnalysisGeneral AssumptionsEach data item is a tuple (vector)Values of tuple are nominal, ordinal or numericalSimilarity = (Distance)-1Pure Numerical TuplesSim(di,dj) = di,kdj,ksim (di,dj) = cos(didj)…and many more (slide after next)Similarity Measures in Data AnalysisFor Ordinal ValuesE.g. "small," "medium," "large," "X-large"Convert to numerical assuming constant …on a normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolateE.g. "small"=0, "medium"=0.33, etc.Then, use numerical similarity measuresOr, use similarity matrix (see next slide)Similarity Measures (cont.)For Nominal ValuesE.g. "Boston", "LA", "Pittsburgh", or "male", "female", or "diffuse", "globular", "spiral", "pinwheel"Binary rule: If di,k=dj,k, then sim=1, else 0Use underlying sematic property: E.g. Sim(Boston, LA)=dist(Boston, LA)-1, or Sim(Boston, LA)=(|size(Boston) – size(LA)| )-1Use similarity MatrixSimilarity Matrixtinylittle small medium large hugetiny1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0Diagonal must be 1.0Monotonicity property must holdTriangle inequality must holdTransitive property need *not* holdDocument Clustering TechniquesSimilarity or Distance Measure:Alternative ChoicesCosine similarity Euclidean distanceKernel functions, e.g.,Language Modeling P(y|modelx) where x and y are documentsDocument Clustering TechniquesKullback Leibler distance ("relative entropy")Incremental Clustering MethodsGiven n data items: D: D1, D2,…Di,…DnAnd given minimal similarity threshold: SminCluster data incrementally as follows:Procedure Singlelink(D)Let CLUSTERS = {D1}For i=2 to n Let Dc =Argmax[Sim(Di,Dj]; j<iIf Dc>Smin, add Dj to Dc's clusterElse Append(CLUSTERS, {Dj};; new clusterIncremental Clustering (cont.)Procedure Averagelink(D)Let CLUSTERS = {D1}For i=2 to n Let Dc =Argmax[Sim(Di, centroid(C)] C in CLUSTERSIf Dc>Smin, add Dj to cluster CElse Append(CLUSTERS, {Dj};; new clusterObservationsSingle pass over the dataeasy to cluster new data incrementallyRequires arbitrary Smin thresholdO(N2) time, O(N) spaceDocument Clustering TechniquesExample. Group documents based on similaritySimilarity matrix:Thresholding at similarity value of .9 yields:complete graph C1 = {1,4,5}, namely Complete Linkageconnected graph C2={1,4,5,6}, namely Single LinkageFor clustering we need three things:A similarity measure for pairwise comparison between documentsA clustering criterion (complete Link, Single Ling,…)A clustering algorithmDocument Clustering TechniquesClustering Criterion: Alternative LinkagesSingle-link ('nearest neighbor"):Complete-link:Average-link ("group average clustering") or GAC):Non-hierarchical Clustering MethodsA Single-Pass Algorithm1. Treat the first document as the first cluster (singleton cluster).2. Compare each subsequent document to all the clusters processed so far.3. Add this new document to the closest cluster if the intercluster similarity is above the similarity threshold (predetermined); otherwise, leave the new document alone as a new cluster.4. Repeat Steps 2 and 3 until all the documents are processed. - O(n2) time and O(n) space (worst case complexity)Non-hierarchical Methods (cont.)Multi-pass K-means ("reallocation method")1. Select K initial centroids (the "seeds")2. Assign each document to the closeest centroid, resulting in K clusters.3. Recompute the centroid for each of the K clusters.4. Repeat Steps 2 and 3 until the centroids are stabilized.- O(nK) time and O(K) space per passHierarchical Agglomerative Clustering MethodsGeneric Agglomerative Procedure (Salton '89):-result in nested clusters via iterations1. Compute all pairwise document-document similarity coefficients2. Place each of n documents into a class of its own3. Merge the two most similar clusters into one; - replace the two clusters by the new cluster - compute intercluster similarity scores w.r.t. the new cluster4. Repeat the above step until only one cluster is leftHierarchical Agglomerative Clustering Methods (cont.)Heuristic Approaches to Speedy Clustering:Reallocation methods with k selected-seeds (O(kn) time)- k is the desired number of clusters; n is the number of documentsBuckshot: random sampling (of (k)n documents) puls global HACFractionation: Divide and ConquerCreating TaxonomiesHierarchical ClusteringGAC trace creates binary hierarchyIncremental-link Hierarchical version1. Cluster data with high Smin 1st hierarchical level2. Decrease Smin (stop at Smin=0)3. Treat cluster centroids as data tuples and recluster, creating next level of hierarchy, then repeat steps 2 and 3.K-means Hierarchical k-means1. Cluster data with large k2. Decrease k (stop at k=1)3. Treat cluster centroids as data


View Full Document

CMU CS 15381 - Unsupervised Machine Learning Methods

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

Handouts

Handouts

38 pages

mdp

mdp

37 pages

HW2

HW2

7 pages

nn

nn

25 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

27 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more
Download Unsupervised Machine Learning Methods
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Unsupervised Machine Learning Methods and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Unsupervised Machine Learning Methods 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?