BU CS 565 - Lecture notes - D901333

Home> Schools> Boston University> Computer Science (CS) > CS 565> Lecture notes

BU CS 565 - Lecture notes

Course Cs 565- Advanced Java Programming

Pages 37

Download Save

Unformatted text preview:

Lecture outlineDensity-Based Clustering MethodsClassification of points in density-based clusteringCore, border and noise pointsCore, Border and Noise pointsClusters output by DBScanClassification of points in density-based clusteringDBSCAN: The AlgorithmTime and space complexity of DBSCANWhen DBSCAN Does NOT Work WellDetermining EPS and MinPtsWhen DBSCAN Does NOT Work WellDetermining EPS and MinPtsStrengths and weaknesses of DBSCANLecture outlineClusteringCo-ClusteringMotivation: Sponsored SearchMotivation: Sponsored SearchCo-Clusters in Sponsored SearchCo-Clustering in Sponsored SearchClustering of the rowsClustering of the columnsCost of clusteringCo-ClusteringCo-Clustering Objective FunctionSome BackgroundAlgorithmProperties of the algorithmAlgorithm--detailsFrom distance to points: Multi-dimensional scalingMulti-Dimensional Scaling (MDS)ProblemHow can we do that? (Algorithm)High-level view of the MDS algorithmThe MDS algorithmQuestions about MDSLecture outline• Density‐based clustering (DB‐Scan)– Reference: Martin Ester, Hans‐Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density‐Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 2006• Co‐clustering (or bi‐clustering)• References: – A. Anagnostopoulos, A. Dasgupta and R. Kumar: Approximation Algorithms for co‐clustering, PODS 2008.– K. Puolamaki. S. Hanhijarvi and G. Garriga: An approximation ratio for biclustering, Information Processing Letters 2008.Density‐Based Clustering Methods• Clustering based on density (local cluster criterion), such as density‐connected points• Major features:– Discover clusters of arbitrary shape– Handle noiseClassification of points in density‐based clustering• Core points: Interior points of a density‐based cluster. A point p is a core point if for distance Eps :– |NEps(p)={q | dist(p,q) <= ε }| ≥ MinPts• Border points: Not a core point but within the neighborhood of a core point (it can be in the neighborhoods of many core points)• Noise points: Not a core or a border pointCore, border and noise pointsEpsEps EpsCore, Border and Noise pointsOriginal PointsPoint types: core, borderand noiseEps = 10, MinPts = 4Clusters output by DBScanOriginal PointsClusters• Resistant to Noise• Can handle clusters of different shapes and sizesClassification of points in density‐based clustering• Core points: Interior points of a density‐based cluster. A point p is a core point if for distance Eps :– |NEps(p)={q | dist(p,q) <= ε }| ≥ MinPts• Border points: Not a core point but within the neighborhood of a core point (it can be in the neighborhoods of many core points)• Noise points: Not a core or a border pointDBSCAN: The Algorithm– Label all points as core, border, or noise points– Eliminate noise points – Put an edge between all core points that are within Eps of each other– Make each group of connected core points into a separate cluster– Assign each border point to one of the cluster of its associated core pointsTime and space complexity of DBSCAN• For a dataset X consisting of n points, the time complexity of DBSCAN is O(n x time to find points in the Eps‐neighborhood)• Worst case O(n2)• In low‐dimensional spaces O(nlogn); efficient data structures (e.g., kd‐trees) allow for efficient retrieval of all points within a given distance of a specified pointWhen DBSCAN Does NOT Work WellOriginal Points(MinPts=4, Eps=9.75).(MinPts=4, Eps=9.92)DBScan can fail to identify clusters of varying densities •Determining EPS and MinPts• Idea is that for points in a cluster, their kthnearest neighbors are at roughly the same distance• Noise points have the kthnearest neighbor at farther distance• So, plot sorted distance of every point to its kthnearest neighborWhen DBSCAN Does NOT Work WellOriginal Points(MinPts=4, Eps=9.75).(MinPts=4, Eps=9.92)•Varying densities•High‐dimensional dataDetermining EPS and MinPts• Idea is that for points in a cluster, their kthnearest neighbors are at roughly the same distance• Noise points have the kthnearest neighbor at farther distance• So, plot sorted distance of every point to its kthnearest neighborStrengths and weaknesses of DBSCAN• Resistant to noise• Finds clusters of arbitrary shapes and sizes• Difficulty in identifying clusters with varying densities• Problems in high‐dimensional spaces; notion of density unclear• Can be computationally expensive when the computation of nearest neighbors is expensiveLecture outline• Density‐based clustering• Co‐clustering (or bi‐clustering)• References: – A. Anagnostopoulos, A. Dasgupta and R. Kumar: Approximation Algorithms for co‐clustering, PODS 2008.– K. Puolamaki. S. Hanhijarvi and G. Garriga: An approximation ratio for biclustering, Information Processing Letters 2008.AClustering3068972 3 4128101 2 3109 80848792 4 31191016 10 13 6 7 51089237• m points in Rn• Group them to k clusters• Represent them by a matrix A∈Rm×n– A point corresponds to a row of A• Cluster: Partition the rows to k groupsmnRnCo‐Clustering3068972341281012310980848972431191016 10 13 6 7 51089237A• Co‐Clustering: Cluster rows and columns of A simultaneously:k = 2ℓ = 2Co‐clusterMotivation: Sponsored SearchMain revenue for search engines•Advertisers bid on keywords• A user makes a query• Show ads of advertisers that are relevant and have high bids• User clicks or not an adAdsMotivation: Sponsored Search• For every(advertiser, keyword) pairwe have:– Bid amount– Impressions– # clicks• Mine information at quer y time – Maximize # clicks / revenueSki bootsCo‐Clusters in Sponsored SearchAdvertiserKeywordsVancouverAir FranceSkis.comBids of skis.com for “ski boots”Markets = co‐clustersAll these keywords are relevantto a set of advertisersCo‐Clustering in Sponsored SearchApplicat ions:•Keyword suggestion– Recommend to advertisers other relevant keywords• Broad matching / market expansion– Include more advertisers to a query• Isolate submarkets– Important for economists– Apply different advertising approaches• Build taxonomies of advertisers / keywordsAClustering of the rows3068972 3 4128101 2 3109 80848792 4 31191016 10 13 6 7 51089237• m points in Rn• Group them to k clusters• Represent them by a matrix A∈Rm×n– A point corresponds to a row of A• Clustering: Partitioning of the rows into k groupsmnRnClustering of the

View Full Document


School:
Email:
New Password:
Confirm Password:

BU CS 565 - Lecture notes

Sign up for free to view:

Please select your school