Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 15 10/13/2011 10/17/11 CSCI5417-IR 2Today 10/13  More Clustering  Finish flat clustering  Hierarchical clustering2 10/17/11 CSCI5417-IR 3K-Means  Assumes documents are real-valued vectors.  Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:  Iterative reassignment of instances to clusters is based on distance to the current cluster centroids.  (Or one can equivalently phrase it in terms of similarities) € µ(c) =1| c | x  x ∈c∑10/17/11 CSCI5417-IR 4K-Means Algorithm Select K random docs {s1, s2,… sK} as seeds. Until stopping criterion: For each doc di: Assign di to the cluster cj such that dist(di, sj) is minimal. For each cluster c_j s_j = m(c_j)3 10/17/11 CSCI5417-IR 5K Means Example (K=2) Pick seeds Assign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged! 10/17/11 CSCI5417-IR 6Termination conditions  Several possibilities  A fixed number of iterations  Doc partition unchanged  Centroid positions don’t change4 Convergence  Why should the K-means algorithm ever reach a fixed point?  A state in which clusters don’t change.  K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.  EM is known to converge.  Number of iterations could be large.  But in practice usually isn’t Sec. 16.4 Seed Choice  Results can vary based on random seed selection.  Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.  Select good seeds using a heuristic (e.g., doc least similar to any existing mean)  Try out multiple starting points  Initialize with the results of another method. Sec. 16.45 Do this with K=2 10/17/11 CSCI5417-IR 910/17/11 CSCI5417-IR 10Hierarchical Clustering  Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate6 Dendrogram: Hierarchical Clustering  Traditional clustering partition is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. 11Break  Past HW  Best score on part 2 is .437  Best approaches  Multifield indexing of title/keywords/abstract  Snowball (English), Porter  Tuning the stop list  Ensemble (voting)  Mixed results  Boosts  Relevance feedback 10/17/11 CSCI5417-IR 127 Descriptions  For the most part, your approaches were pretty weak (or your descriptions were)  Failed to report R-Precision  Use of some kind of systematic approach  X didn’t work  Interactions between approaches  Lack of details  Use relevance feedback and it gave me Z  I changed the stop list  Boosted the title field  Etc. 10/17/11 CSCI5417-IR 13Next HW  Due 10/25  I have a new untainted test set  So don’t worry about checking for the test document; it won’t be there 10/17/11 CSCI5417-IR 148 10/17/11 CSCI5417-IR 15 Agglomerative (bottom-up):  Start with each document being a single cluster.  Eventually all documents belong to the same cluster.  Divisive (top-down):  Start with all documents belong to the same cluster.  Eventually each node forms a cluster on its own.  Does not require the number of clusters k to be known in advance  But it does need a cutoff or threshold parameter condition Hierarchical Clustering algorithms 10/17/11 CSCI5417-IR 16Hierarchical -> Partition  Run the algorithm to completion  Take a slice across the tree at some level  Produces a partition  Or insert an early stopping condition into either top-down or bottom-up9 10/17/11 CSCI5417-IR 17Hierarchical Agglomerative Clustering (HAC)  Assumes a similarity function for determining the similarity of two instances and two clusters.  Starts with all instances in separate clusters and then repeatedly joins the two clusters that are most similar until there is only one cluster.  The history of merging forms a binary tree or hierarchy. 10/17/11 CSCI5417-IR 18Hierarchical Clustering  Key problem: as you build clusters, how do you represent each cluster, to tell which pair of clusters is closest?10 10/17/11 CSCI5417-IR 19“Closest pair” in Clustering  Many variants to defining closest pair of clusters  Single-link  Similarity of the most cosine-similar  Complete-link  Similarity of the “furthest” points, the least cosine-similar  “Center of gravity”  Clusters whose centroids (centers of gravity) are the most cosine-similar  Average-link  Average cosine between all pairs of elements 10/17/11 CSCI5417-IR 20Single Link Agglomerative Clustering  Use maximum similarity of pairs:  Can result in “straggly” (long and thin) clusters due to chaining effect.  After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:11 10/17/11 CSCI5417-IR 21Single Link Example 10/17/11 CSCI5417-IR 22Complete Link Agglomerative Clustering  Use minimum similarity of pairs:  Makes “tighter,” spherical clusters that are typically preferable.  After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:12 10/17/11 CSCI5417-IR 23Complete Link Example 10/17/11 CSCI5417-IR 24Misc. Clustering Topics  Clustering terms  Clustering people  Feature selection  Labeling clusters13 10/17/11 CSCI5417-IR 25Term vs. document space  So far, we clustered docs based on their similarities in term space  For some applications, e.g., topic analysis for inducing navigation structures, you can “dualize”:  Use docs as axes  Represent (some) terms as vectors  Cluster terms, not docs


View Full Document

CU-Boulder CSCI 5417 - Clustering

Download Clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?