UW-Madison CS 766 - Video Google - A Text Retrieval Approach to Object Matching in Videos - D1338831

Home> Schools> University of Wisconsin, Madison> (CS) > CS 766> Video Google - A Text Retrieval Approach to Object Matching in Videos

DOC PREVIEW

UW-Madison CS 766 - Video Google - A Text Retrieval Approach to Object Matching in Videos

School name University of Wisconsin, Madison

Course Cs 766- Computer Vision

Pages 37

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Video Google: A Text Retrieval Approach to Object Matching in VideosJosef Sivic and Andrew ZissermanGoal• Google search for videos• Query is an portion of a frame of a video selected by the userGoogle Text Search• Web pages are parsed into words• Words are replaced by their root word• Stop list to filter common words• Remaining words represent that web page as a vector weighted based on word frequencyText Retrieval• Efficient retrieval for with an index• Text is retrieved by computing its vector of word frequencies, return documents with the closest vectors• Consider order and location of wordsApproach• Apply text search properties to image searchVideo Google: Descriptors• Compute two types of covariant regions: Shape Adapted and Maximally Stable• Regions computed in grayscaleDescriptorsDescriptors• Each elliptical region is then represented by a SIFT descriptor• Descriptor is averaged over the frames the region exists in• Reduce noise: filter regions which do not exist in more than 3 frames• Reject 10% of the regions with the largest diagonal covariance matrixBuild “Visual Words”• Quantize the descriptors into visual words for text retrieval• 1000 regions per frame and 128-vector descriptor• Select 48 scenes containing 10,000 frames• 200K descriptorsClustering descriptors• K-means clustering• Run several times with random initial assignments• D(x1, x2) = sqrt((x1 - x2)T∑-1(x1 - x2))• MS and SA regions are clustered separatelyIndexing using text retrieval methods• Term frequency - inverse document frequency used for weighting the words of a document• Retrieval: documents are ranked by their normalized scalar product between the query vector and all the document vectorsImage Retrieval• Video google: The visual words of the query are the visual words in the user-specified portion of a frame• Search the index with the visual words to find all the frames which contain the same word• Rank all the results, return the most relevant resultsStop List• Visual words in the top 5% and bottom 10% are stoppedSpatial Consistency• Google increases the ranking of documents where the query words appear close together in the searched text• In video: 15 nearest neighbors defines search area• Regions in this area by the query region vote on each match• Re-ranked on the number of votesEvaluation• Tested on feature length movies with 100K - 150K frames• Use one frame per second• Ground truth determined by hand• Retrieval performance measured by averaged rank of relevant imagesExampleQuestions?Scalable Recognition with a Vocabulary TreeDavid Nister andHenrik SteweniusVocabulary Tree• Continuation of Video google• 10,000 visual words in the database• Offline crawling stage to index video takes 10 seconds per frameVocabulary Tree• Too slow for a large database• Larger databases result in better retrieval quality• More words utilizes the power of the index: less database images must be considered• On the fly insertion of new objects into the databaseTraining• Training with hierarchical k-means• More efficient than k-means• 35,000 training frames instead of 400 with video googleFeature Extraction• Maximally Stable regions used only• Build SIFT descriptor from the regionBuilding Vocab Tree• Hierarchical k-means, with k being the number of children nodes• First run k-means to find k clusters• Recursively apply to each cluster L times• Visual words become the nodesPerformance• Increasing the size of the vocabulary is logarithmic• K = 10, L = 6: one million leaf nodesRetrieval• Determine the visual words from the query• Propagate the region descriptor down the tree selecting the closest cluster at each levelScoring• Determine the relevance of a query image to a database image based on the similarity of their paths down the tree• Use TD-IDF to assign weights to the query and database image vectorScoring• Use TD-IDF for weights of descriptor vectors• Normalized relevance score:• L1-normalization is the most effectiveResults• Tested on a ground truth database of 6,376 images• Groups of four images of the same objectResultsResults• Tested on a database of 1 million images of CD covers• Sub-second retrieval times for a database of a million images• Performance increases with the number of leaf

View Full Document