BU CS 565 - Nearest-neighbor rule - D519686

Home> Schools> Boston University> Computer Science (CS) > CS 565> Nearest-neighbor rule

BU CS 565 - Nearest-neighbor rule

Course Cs 565- Advanced Java Programming

Pages 51

Download Save

Unformatted text preview:

Lecture outlineDefinitionMotivationNearest-neighbor ruleMNIST dataset “2”Methods for computing NN2-dimensional kd-trees2-dimensional kd-treesConstruction of kd-treesConstruction of kd-treesConstruction of kd-treesConstruction of kd-treesConstruction of kd-treesThe complete kd-treeRegion of node vSearching in kd-treeskd-tree: range queriesQuery time analysisQuery time (Cont’d)d-dimensional kd-treesConstruction of the d-dimensional kd-treesLocality-sensitive hashing (LSH)Approximate Nearest NeighborLocality-Sensitive Hashing (LSH)Algorithm -- preprocessingAlgorithm -- queryApplications of LSH in data miningApplicationsHow would you do it?Running example: comparing documentsFinding similar documentsKey stepsShinglesAssumptionData model: setsSimilarity measures for setsFind similar objects using the Jaccard similaritySpeedingup the naïve methodStill problemsCreating signaturesIntuition behind Jaccard similarityA type of signatures -- minhashes“Surprising” propertyMinhash algorithmExample of minhash signaturesExample of minhash signaturesExample of minhash signaturesExample of minhash signaturesIs it now feasible?Being more practicalExample of minhash signaturesLecture outline•Nearest-neighbor search in low dimensions–kd-trees•Nearest-neighbor search in high dimensions–LSH•Applications to data miningDefinition•Given: a set X of n points in Rd•Nearest neighbor: for any query point qєRd return the point xєX minimizing D(x,q)•Intuition: Find the point in X that is the closest to qMotivation•Learning: Nearest neighbor rule•Databases: Retrieval•Data mining: Clustering•Donald Knuth in vol.3 of The Art of Computer Programming called it the post-office problem, referring to the application of assigning a resident to the nearest-post officeNearest-neighbor ruleMNIST dataset “2”Methods for computing NN •Linear scan: O(nd) time•This is pretty much all what is known for exact algorithms with theoretical guarantees•In practice:–kd-trees work “well” in “low-medium” dimensions2-dimensional kd-trees•A data structure to support range queries in R2–Not the most efficient solution in theory–Everyone uses it in practice•Preprocessing time: O(nlogn)•Space complexity: O(n)•Query time: O(n1/2+k)2-dimensional kd-trees•Algorithm:–Choose x or y coordinate (alternate)–Choose the median of the coordinate; this defines a horizontal or vertical line–Recurse on both sides•We get a binary tree:–Size O(n)–Depth O(logn)–Construction time O(nlogn)Construction of kd-treesConstruction of kd-treesConstruction of kd-treesConstruction of kd-treesConstruction of kd-treesThe complete kd-treeRegion of node vRegion(v) : the subtree rooted at v stores the points in black dotsSearching in kd-trees•Range-searching in 2-d–Given a set of n points, build a data structure that for any query rectangle R reports all point in Rkd-tree: range queries•Recursive procedure starting from v = root•Search (v,R)–If v is a leaf, then report the point stored in v if it lies in R–Otherwise, if Reg(v) is contained in R, report all points in the subtree(v)–Otherwise:•If Reg(left(v)) intersects R, then Search(left(v),R)•If Reg(right(v)) intersects R, then Search(right(v),R)Query time analysis•We will show that Search takes at most O(n1/2+P) time, where P is the number of reported points–The total time needed to report all points in all sub-trees is O(P)–We just need to bound the number of nodes v such that region(v) intersects R but is not contained in R (i.e., boundary of R intersects the boundary of region(v))–gross overestimation: bound the number of region(v) which are crossed by any of the 4 horizontal/vertical linesQuery time (Cont’d)•Q(n): max number of regions in an n-point kd-tree intersecting a (say, vertical) line?•If ℓ intersects region(v) (due to vertical line splitting), then after two levels it intersects 2 regions (due to 2 vertical splitting lines)•The number of regions intersecting ℓ is Q(n)=2+2Q(n/4)  Q(n)=(n1/2)d-dimensional kd-trees•A data structure to support range queries in Rd•Preprocessing time: O(nlogn)•Space complexity: O(n)•Query time: O(n1-1/d+k)Construction of the d-dimensional kd-trees•The construction algorithm is similar as in 2-d•At the root we split the set of points into two subsets of same size by a hyperplane vertical to x1-axis•At the children of the root, the partition is based on the second coordinate: x2-coordinate•At depth d, we start all over again by partitioning on the first coordinate•The recursion stops until there is only one point left, which is stored as a leafLocality-sensitive hashing (LSH)•Idea: Construct hash functions h: Rd U such that for any pair of points p,q:–If D(p,q)≤r, then Pr[h(p)=h(q)] is high–If D(p,q)≥cr, then Pr[h(p)=h(q)] is small•Then, we can solve the “approximate NN” problem by hashing•LSH is a general framework; for a given D we need to find the right hApproximate Nearest Neighbor•Given a set of points X in Rd and query point qєRd c-Approximate r-Nearest Neighbor search returns: –Returns p P, D(p,q) ≤ r∈– Returns NO if there is no p’ X, D(p’,q) ≤ cr∈Locality-Sensitive Hashing (LSH)•A family H of functions h: RdU is called (P1,P2,r,cr)-sensitive if for any p,q:–if D(p,q)≤r, then Pr[h(p)=h(q)] ≥ P1–If D(p,q)≥ cr, then Pr[h(p)=h(q)] ≤ P2•P1 > P2•Example: Hamming distance–LSH functions: h(p)=pi, i.e., the i-th bit of p–Probabilities: Pr[h(p)=h(q)]=1-D(p,q)/dAlgorithm -- preprocessing•g(p) = <h1(p),h2(p),…,hk(p)>•Preprocessing–Select g1,g2,…,gL–For all pєX hash p to buckets g1(p),…,gL(p)–Since the number of possible buckets might be large we only maintain the non empty ones•Running time?Algorithm -- query•Query q:–Retrieve the points from buckets g1(q),g2(q),…, gL(q) and let points retrieved be x1,…,xL •If D(xi,q)≤r report it•Otherwise report that there does not exist such a NN –Answer the query based on the retrieved points–Time O(dL)Applications of LSH in data mining•Numerous….Applications•Find pages with similar sets of words (for clustering or classification)•Find users in Netflix data that watch similar movies•Find movies with similar sets of users•Find images of related thingsHow would you do it?•Finding very similar items might be computationally demanding task•We can relax our requirement to finding somewhat

View Full Document


School:
Email:
New Password:
Confirm Password:

BU CS 565 - Nearest-neighbor rule

Sign up for free to view:

Please select your school