BU CS 565 - Density-based clustering (DB-Scan)

Unformatted text preview:

Lecture outline• Density-based clustering (DB-Scan)– Reference: Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 2006• Co-clustering (or bi-clustering)• References: – A. Anagnostopoulos, A. Dasgupta and R. Kumar: Approximation Algorithms for co-clustering, PODS 2008.– K. Puolamaki. S. Hanhijarvi and G. Garriga: An approximation ratio for biclustering, Information Processing Letters 2008.Density-Based Clustering Methods• Clustering based on density (local cluster criterion), such as density-connected points• Major features:– Discover clusters of arbitrary shape– Handle noiseClassification of points in density-based clustering• Core points: Interior points of a density-based cluster. A point p is a core point if for distance Eps :– |NEps(p)={q | dist(p,q) <= e }| ≥ MinPts• Border points: Not a core point but within the neighborhood of a core point (it can be in the neighborhoods of many core points)• Noise points: Not a core or a border pointCore, border and noise pointsEpsEps EpsCore, Border and Noise pointsOriginal PointsPoint types: core, borderand noiseEps = 10, MinPts = 4Clusters output by DBScanOriginal PointsClusters• Resistant to Noise• Can handle clusters of different shapes and sizesClassification of points in density-based clustering• Core points: Interior points of a density-based cluster. A point p is a core point if for distance Eps :– |NEps(p)={q | dist(p,q) <= e }| ≥ MinPts• Border points: Not a core point but within the neighborhood of a core point (it can be in the neighborhoods of many core points)• Noise points: Not a core or a border pointDBSCAN: The Algorithm– Label all points as core, border, or noise points– Eliminate noise points – Put an edge between all core points that are within Eps of each other– Make each group of connected core points into a separate cluster– Assign each border point to one of the cluster of its associated core pointsTime and space complexity of DBSCAN• For a dataset X consisting of n points, the time complexity of DBSCAN is O(n x time to find points in the Eps-neighborhood)• Worst case O(n2)• In low-dimensional spaces O(nlogn); efficient data structures (e.g., kd-trees) allow for efficient retrieval of all points within a given distance of a specified pointWhen DBSCAN Does NOT Work WellOriginal Points(MinPts=4, Eps=9.75).(MinPts=4, Eps=9.92)DBScan can fail to identify clusters of varying densities •Determining EPS and MinPts• Idea is that for points in a cluster, their kthnearest neighbors are at roughly the same distance• Noise points have the kthnearest neighbor at farther distance• So, plot sorted distance of every point to its kthnearest neighborWhen DBSCAN Does NOT Work WellOriginal Points(MinPts=4, Eps=9.75).(MinPts=4, Eps=9.92)• Varying densities• High-dimensional dataDetermining EPS and MinPts• Idea is that for points in a cluster, their kthnearest neighbors are at roughly the same distance• Noise points have the kthnearest neighbor at farther distance• So, plot sorted distance of every point to its kthnearest neighborStrengths and weaknesses of DBSCAN• Resistant to noise• Finds clusters of arbitrary shapes and sizes• Difficulty in identifying clusters with varying densities• Problems in high-dimensional spaces; notion of density unclear• Can be computationally expensive when the computation of nearest neighbors is expensiveLecture outline• Density-based clustering• Co-clustering (or bi-clustering)• References: – A. Anagnostopoulos, A. Dasgupta and R. Kumar: Approximation Algorithms for co-clustering, PODS 2008.– K. Puolamaki. S. Hanhijarvi and G. Garriga: An approximation ratio for biclustering, Information Processing Letters 2008.AClustering3 0 6 8 9 72 3 4 12 8 101 2 3 10 9 80 8 4 8 7 92 4 3 11 9 1016 10 13 6 7 510 8 9 2 3 7• m points in Rn• Group them to k clusters• Represent them by a matrix ARm×n– A point corresponds to a row of A• Cluster: Partition the rows to k groupsmnRnCo-Clustering3 0 6 8 9 72 3 4 12 8 101 2 3 10 9 80 8 4 8 9 72 4 3 11 9 1016 10 13 6 7 510 8 9 2 3 7A• Co-Clustering: Cluster rows and columns of A simultaneously:k = 2ℓ = 2Co-clusterMotivation: Sponsored SearchMain revenue for search engines• Advertisers bid on keywords• A user makes a query• Show ads of advertisers that are relevant and have high bids• User clicks or not an adAdsMotivation: Sponsored Search• For every(advertiser, keyword) pairwe have:– Bid amount– Impressions– # clicks• Mine information at query time – Maximize # clicks / revenueSki bootsCo-Clusters in Sponsored SearchAdvertiserKeywordsVancouverAir FranceSkis.comBids of skis.com for “ski boots”Markets = co-clustersAll these keywords are relevantto a set of advertisersCo-Clustering in Sponsored SearchApplications:• Keyword suggestion– Recommend to advertisers other relevant keywords• Broad matching / market expansion– Include more advertisers to a query• Isolate submarkets– Important for economists– Apply different advertising approaches• Build taxonomies of advertisers / keywordsAClustering of the rows3 0 6 8 9 72 3 4 12 8 101 2 3 10 9 80 8 4 8 7 92 4 3 11 9 1016 10 13 6 7 510 8 9 2 3 7• m points in Rn• Group them to k clusters• Represent them by a matrix ARm×n– A point corresponds to a row of A• Clustering: Partitioning of the rows into k groupsmnRnClustering of the columns3 0 6 8 9 72 3 4 12 8 101 2 3 10 9 80 8 4 8 7 92 4 3 11 9 1016 10 13 6 7 510 8 9 2 3 73 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 911 11 11 5 5 511 11 11 5 5 5A R• n points in Rm• Group them to k clusters• Represent them by a matrix ARm×n– A point corresponds to a column of A• Clustering: Partitioning of the columns into kgroupsRnmnR MACost of clustering3 0 6 8 9 72 3 4 12 8 101 2 3 10 9 80 8 4 8 7 92 4 3 11 9 1016 10 13 6 7 510 8 9 2 3 71.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.813 9 11 4 5 613 9 11 4 5 6AIOriginal data points A Data representation A’• In A’ every point in A (row or column) is replaced by the corresponding representative (row or column)• The quality of the clustering is measured by computing distances between the data in the cells of A and A’. • k-means clustering: cost = ∑i=1…n∑j=1…m (A(i,j)-A’(i,j))2• k-median


View Full Document

BU CS 565 - Density-based clustering (DB-Scan)

Documents in this Course
Load more
Download Density-based clustering (DB-Scan)
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Density-based clustering (DB-Scan) and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Density-based clustering (DB-Scan) 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?