Unformatted text preview:

Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu Charles Rosenberg Henry A Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton on May 6th 2008 for CSE 291 1 Problem Statement Goal 1 Find approximate nearest neighbors for a repository of over one billion images Goal 2 Perform clustering based on the results 2 Context of the Task Billions of images on the web Modern image search is text based largely due to so many images Scale makes most computer vision tasks infeasible in real time 3 Nearest Neighbor Search NNS Applications First step for Image clustering Object recognition and classification Useful for Organizing the images on the web by finding near duplicate images of items such as CD covers 4 Outline Background Brute force nearest neighbor search k D trees Metric Trees Spill Trees Hybrid Spill Trees Image preprocessing Parallel computing framework and data partition MapReduce Using MapReduce for parallel version of Hybrid Spill Trees Results 5 NNS Math Framework Assume a d dimensional space S Assume a set of points T S Assume a distance measure Given a new point p S we want to find the point v T that is most similar to p 6 Brute force NNS Given a new point p S compute the distance between p and every point v T Whichever point in T has the smallest distance is the nearest neighbor 7 k D Trees Axis parallel partitions of the data Root of the tree represents the entire space Invariant the union of each level of the tree represents the entire space 8 Example of k D Trees Status of k D tree 9 Example of k D Trees cont Status of k D tree 10 Example of k D Trees cont Status of k D tree 11 Example of k D Trees cont Ideal case when searching nearest neighbor falls into the same node as the query NN query point 12 Example k D Trees cont Unfortunate case when searching nearest neighbor falls into a different node as the query query point NN Must do backtracking 13 Metric ball Trees Same as k D trees except we use hyperspheres to partition the data 14 Example of Metric Trees Status of metric tree 15 Example of Metric Trees cont Status of metric tree Child ownership cannot overlap but spheres can 16 Example of Metric Trees cont Status of metric tree Invariant x sphere d center of sphere x radius of sphere 17 Example of Metric Trees cont query point 18 Searching with Metric Trees Guided depth first search DFS with pruning Descend the tree to reach the hypersphere leaf node where the query lies Assign a candidate NN x with distance r from the query If DFS is about to visit a node v but no member of v can be within distance r from the query prune this node do not visit it or any of its children This is whenever v center q v radius r 19 Spill Trees Similar to Metric Trees except that the children of a single node can share data points 20 Metric vs Spill Let N v denote the set of points represented by node v Let v lc and v rc denote the left and right children of v In Metric Trees N v N v lc N v rc N v lc N v rc In Spill Trees N v N v lc N v rc N v lc N v rc 21 Constructing a Spill Tree Given a node v we choose two pivot points v lpv N v and v rpv N v ideally such that they are maximally separated Specifically v lpv v rpv max p1 p 2 N v p1 p 2 22 Constructing a Spill Tree cont Project all the data points down to the vector u v rpv v lpv Find the midpoint A along u L denotes the d 1 dimensional plane orthogonal to u which goes through A L is known as the decision boundary 23 Constructing a Spill Tree cont We define two separating planes LL and LR both parallel to and at distance from L LL and LR define a stripe also known as the overlap buffer Metric Trees have empty stripes All data points to the right of LL belong in v rc All data points to the left of LR belong in v lc All data points in the stripe are shared by v lc and v rc 24 25 Spill Tree NN Search Use defeatist search which descends the tree according the the decision boundary L at each node without backtracking outputting the point x in the first leaf node visited Not guaranteed to find the correct NN Wider stripe means slower search but more accurate 26 Drawbacks of Spill Trees The depth of Spill Trees varies considerably depending on where 2 is the overlap buffer size If 0 the Spill Tree acts as a Metric Tree If v rpv v lpv 2 then N v lc N v rc N v and construction of a Spill Tree does not even terminate giving it a depth of To address this we use Hybrid Spill Trees 27 Hybrid Spill Trees Define a balance threshold 1 usually set to 70 For each node v we first split the data points using the overlapping buffer If either of its children contains more than fraction of the total data points in v we undo the overlapping splitting instead use a conventional metric tree partition and mark v as a non overlapping node This ensures that each split reduces the number of data points of a node by a constant factor maintaining logarithmic depth of the tree 28 Hybrid Spill Tree Search Hybrid of Metric Tree DFS and defeatist search Only do defeatist search on overlapping nodes For non overlapping nodes we still do backtracking as Metric Tree DFS 29 The Drawbacks All of these algorithms were designed to run on a single machine In our case our data cannot all fit on a single machine and disk access is too slow Noise affects distance metric Curse of dimensionality Authors will address these drawbacks using a new variant of spill trees 30 Outline Background Brute force nearest neighbor search k D trees Metric Trees Spill Trees Hybrid Spill Trees Image preprocessing Parallel computing framework and data partition MapReduce Using MapReduce for parallel version of Hybrid Spill Trees Results 31 Image Preprocessing Normalize each image Scale the image to a fixed size of 64x64 pixels each pixel is 3 bytes Convert image to Haar wavelet domain All but the largest 60 magnitude coefficients are set to 0 and the remaining ones are quantized to 1 So far the feature vector is 64x64x3 which is still fairly large 32 Image Preprocessing cont Random projection using random unitlength vectors is used to reduce the dimensionality to 100 dimensions 4 additional features are added The average of each color channel The aspect ratio w w h Now the feature vectors are of dimensionality 104 33 Outline Background Brute force nearest neighbor search k D trees Metric Trees Spill Trees …


View Full Document

UCSD CSE 291 - Clustering Billions of Images

Documents in this Course
Bluegene

Bluegene

23 pages

TinyECC

TinyECC

19 pages

MultiNet

MultiNet

18 pages

Lecture 2

Lecture 2

23 pages

AdaBoost

AdaBoost

25 pages

Lecture 9

Lecture 9

46 pages

Lecture

Lecture

5 pages

GPSR

GPSR

18 pages

Load more
Loading Unlocking...
Login

Join to view Clustering Billions of Images and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering Billions of Images and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?