Unformatted text preview:

ICS 278: Data Mining Lecture 9,10: Clustering AlgorithmsProject Progress ReportList of Sections for your Progress ReportClusteringSlide 5Why is Clustering useful?General Issues in ClusteringGeneral Families of Clustering AlgorithmsPartition-Based ClusteringScore Function for Partition-Based ClusteringK-means ClusteringK-means ComplexityK-means example (courtesy of Andrew Moore, CMU)K-meansSlide 15Slide 16Slide 17Slide 18Slide 19Accelerated ComputationsK-means continues…Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28K-means terminatesSlide 30Issues in K-means clusteringProbabilistic Clustering: Mixture ModelsGaussian Mixture Models (GMM)Learning Mixture Models from DataThe E (Expectation) StepThe M (Maximization) StepComplexity of EM for mixturesComments on Mixtures and EM LearningSlide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Selecting K in mixture modelsExample of BIC Score for Red-Blood Cell DataSlide 49Hierarchical ClusteringSlide 51Agglomerative Methods: Bottom-UpDistances Between ClustersDendrogram Using Single-Link MethodDendogram Using Ward’s SSE DistanceDivisive Methods: Top-DownSpectral/Graph-based ClusteringClustering non-vector objectsSlide 59Slide 60Slide 61SummaryData Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineICS 278: Data MiningLecture 9,10: Clustering AlgorithmsPadhraic SmythDepartment of Information and Computer ScienceUniversity of California, IrvineData Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineProject Progress Report•Written Progress Report:–Due Tuesday May 18th in class–Expect at least 3 pages (should be typed not handwritten)–Hand in written document in class on Tuesday May 18th•1 Powerpoint slide:–1 slide that describes your project–Should contain:•Your name (top right corner)•Clear description of the main task•Some visual graphic of data relevant to your task•1 bullet or 2 on what methods you plan to use•Preliminary results or results of exploratory data analysis•Make it graphical (use text sparingly)–Submit by 12 noon Monday May 17thData Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineList of Sections for your Progress Report•Clear description of task (reuse original proposal if needed)–Basic task + extended “bonus” tasks (if time allows)•Discussion of relevant literature –Discuss prior published/related work (if it exists)•Preliminary data evaluation–Exploratory data analysis relevant to your task–Include as many of plots/graphs as you think are useful/relevant•Preliminary algorithm work –Summary of your progress on algorithm implementation so far•If you are not at this point yet, say so–Relevant information about other code/algorithms you have downloaded, some preliminary testing on, etc.–Difficulties encountered so far•Plans for the remainder of the quarter–Algorithm implementation–Experimental methods–Evaluation, validation•Approximately ½ page to 1 page of text per section (graphs/plots don’t count – include as many of these as you like).Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineClustering•“automated detection of group structure in data”–Typically: partition N data points into K groups (clusters) such that the points in each group are more similar to each other than to points in other groups–descriptive technique (contrast with predictive)–for real-valued vectors, clusters can be thought of as clouds of points in p-dimensional spaceData Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineClusteringSometimes easySometimes impossibleand sometimes in betweenData Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineWhy is Clustering useful?• “Discovery” of new knowledge from data–Contrast with supervised classification (where labels are known)–Long history in the sciences of categories, taxonomies, etc–Can be very useful for summarizing large data sets •For large n and/or high dimensionality•Applications of clustering–Discovery of new types of galaxies in astronomical data–Clustering of genes with similar expression profiles–Cluster pixels in an image into regions of similar intensity–Segmentation of customers for an e-commerce store–Clustering of documents produced by a search engine–…. many moreData Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineGeneral Issues in Clustering•Representation:–What types of clusters are we looking for?•Score:–The criterion to compare one clustering to another•Optimization–Generally, finding the optimal clustering is NP-hard•Greedy algorithms to optimize score are widely used•Other issues–Distance function, D(x(i),x(j)) critical aspect of clustering, both•distance of pairs of objects•distance of objects from clusters –How is K selected?–Different types of data•Real-valued versus categorical•Attribute-valued vectors vs. n2 distance matrixData Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineGeneral Families of Clustering Algorithms•partition-based clustering–e.g. K-means•probabilistic model-based clustering–e.g. mixture models [both of the above work with measurement data, e.g., feature vectors]•hierarchical clustering–e.g. hierarchical agglomerative clustering•graph-based clustering–E.g., min-cut algorithms[both of the above work with distance data, e.g., distance matrix]Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvinePartition-Based Clustering•given: n data points X={x(1) … x(n)}•output: k partitions C = {C1 … CK} such that–each x(i) is assigned to unique Cj (hard-assignment)–C implicitly represents a mapping from X to C•Optimization algorithm–require that score[C, X] is maximized•e.g., sum-of-squares of within cluster distances–exhaustive search intractable–combinatorial optimization to assign n objects to k classes–large search space: possible assignment choices ~ kn•so, use greedy interative method •will be subject to local maximaData Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC IrvineScore Function for Partition-Based Clustering•want compact clusters–minimize within cluster distances wc(C)•want


View Full Document

UCI ICS 278 - Clustering Algorithms

Download Clustering Algorithms
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering Algorithms and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering Algorithms 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?