Unformatted text preview:

Slide 1Lecture outlineWhat is clustering?OutliersWhy do we cluster?Applications of clustering?The clustering taskObservations to clusterObservations to clusterDistance functionsData StructuresDistance functions for binary vectorsDistance functions for real-valued vectorsDistance functions for real-valued vectorsPartitioning algorithms: basic conceptThe k-means problemAlgorithmic properties of the k-means problemThe k-means algorithmProperties of the k-means algorithmTwo different K-means ClusteringsDiscussion k-means algorithmSlide 22The k-median problemThe k-medoids algorithmDiscussion of PAM algorithmCLARA (Clustering Large Applications)The k-center problemAlgorithmic properties of the k-centers problemThe farthest-first traversal algorithmThe farthest-first traversal is a 2-approximation algorithmThe farthest-first traversal is a 2-approximation algorithmThe farthest-first traversal is a 2-approximation algorithmWhat is the right number of clusters?Occam’s razor and the minimum description length principleClusteringLecture outline•Distance/Similarity between data objects•Data objects as geometric data points•Clustering problems and algorithms –K-means–K-median–K-centerWhat is clustering?•A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groupsInter-cluster distances are maximizedIntra-cluster distances are minimizedOutliers •Outliers are objects that do not belong to any cluster or form clusters of very small cardinality•In some applications we are interested in discovering outliers, not clusters (outlier analysis)clusteroutliersWhy do we cluster?•Clustering : given a collection of data objects group them so that–Similar to one another within the same cluster–Dissimilar to the objects in other clusters•Clustering results are used:–As a stand-alone tool to get insight into data distribution•Visualization of clusters may unveil important information–As a preprocessing step for other algorithms•Efficient indexing or compression often relies on clusteringApplications of clustering?•Image Processing–cluster images based on their visual content•Web–Cluster groups of users based on their access patterns on webpages–Cluster webpages based on their content•Bioinformatics–Cluster similar proteins together (similarity wrt chemical structure and/or functionality etc)•Many more…The clustering task•Group observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different•Basic questions:–What does “similar” mean–What is a good partition of the objects? I.e., how is the quality of a solution measured–How to find a good partition of the observationsObservations to cluster•Real-value attributes/variables–e.g., salary, height•Binary attributes–e.g., gender (M/F), has_cancer(T/F)•Nominal (categorical) attributes–e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)•Ordinal/Ranked attributes–e.g., military rank (soldier, sergeant, lutenant, captain, etc.)•Variables of mixed types–multiple attributes with various typesObservations to cluster•Usually data objects consist of a set of attributes (also known as dimensions)•J. Smith, 20, 200K•If all d dimensions are real-valued then we can visualize each data point as points in a d-dimensional space•If all d dimensions are binary then we can think of each data point as a binary vectorDistance functions•The distance d(x, y) between two objects xand y is a metric if–d(i, j)0 (non-negativity)–d(i, i)=0 (isolation)–d(i, j)= d(j, i) (symmetry)–d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality) [Why do we need it?]•The definitions of distance functions are usually different for real, boolean, categorical, and ordinal variables.•Weights may be associated with different variables based on applications and data semantics.Data Structures•data matrix•Distance matrixndx...nx...n1x...............idx...ix...i1x...............1dx...1x...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)0attributes/dimensionstuples/objectsobjectsobjectsDistance functions for binary vectorsQ1 Q2 Q3 Q4 Q5 Q6X 1 0 0 1 1 1Y 0 1 1 0 1 0•Jaccard similarity between binary vectors X and Y•Jaccard distance between binary vectors X and YJdist(X,Y) = 1- JSim(X,Y)•Example:•JSim = 1/6•Jdist = 5/6YXYXYXJSim),(Distance functions for real-valued vectors•Lp norms or Minkowski distance:where p is a positive integer•If p = 1, L1 is the Manhattan (or city block) distance:pdiiyixppdxdxpyxpyxyxpL/11)(/1||...|22||11|),(diiyixdydxyxyxyxL1||...|22|||),(111Distance functions for real-valued vectors•If p = 2, L2 is the Euclidean distance:•Also one can use weighted distance:•Very often Lpp is used instead of Lp (why?))||...|22||11(|),(222dydxyxyxyxd )||...|22|2|11|1(),(222dydxdwxxwxxwyxd dydxdwyxwyxwyxd  ...222111),(Partitioning algorithms: basic concept•Construct a partition of a set of n objects into a set of k clusters–Each object belongs to exactly one cluster–The number of clusters k is given in advanceThe k-means problem•Given a set X of n points in a d-dimensional space and an integer k•Task: choose a set of k points {c1, c2,…,ck} in the d-dimensional space to form clusters {C1, C2,…,Ck} such thatis minimized•Some special cases: k = 1, k = n  ki CxiicxLCCost122)(Algorithmic properties of the k-means problem•NP-hard if the dimensionality of the data is at least 2 (d>=2)•Finding the best solution in polynomial time is infeasible•For d=1 the problem is solvable in polynomial time (how?)•A simple iterative algorithm works quite well in practiceThe k-means algorithm•One way of solving the k-means problem•Randomly pick k cluster centers {c1,…,ck}•For each i, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all i≠j•For each i let ci be the center of cluster Ci (mean of the vectors in Ci)•Repeat until convergenceProperties of the k-means algorithm•Finds a local optimum•Converges often


View Full Document

BU CS 565 - Clustering

Documents in this Course
Load more
Download Clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?