Berkeley ELENG 225B - Efficient Visual Search for Objects in Videos - D230098

Home> Schools> University of California, Berkeley> Electrical Engineering (ELENG) > ELENG 225B> Efficient Visual Search for Objects in Videos

DOC PREVIEW

Berkeley ELENG 225B - Efficient Visual Search for Objects in Videos

School name University of California, Berkeley

Course Eleng 225b- Digital Image Processing

Pages 67

This preview shows page 1-2-3-4-31-32-33-34-35-64-65-66-67 out of 67 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1IntroductionState-of-the-Art before this paper…Introduction (cont.)Overview of the TalkOverview of the TalkPre-Processing (Offline)Detection of Affine Covariant RegionsSlide 9Pre-Processing (Offline)Tracking regions through video and rejecting unstable regionsPre-Processing (Offline)Visual Indexing Using Text-Retrieval MethodsVisual VocabularyK-Means ClusteringSlide 16Slide 17Examples of Clusters of RegionsPre-Processing (Offline)Remove Stop-Listed WordsPre-Processing (Offline)Slide 22Slide 23Inverted File IndexingOverview of the TalkReal-Time QuerySlide 27Spatial Consistency VotingSpatial Consistency VotingSlide 30Overview of the TalkImplementation DetailsAlgorithm ImplementationExample Visual SearchOverview of the TalkRetrieval ExamplesRetrieval Examples (cont.)Performance of the AlgorithmPerformance of the Algorithm (cont.)Performance of the Algorithm (cont.)Examples of Missed ShotsExamples of Missed Shots (cont.)Qualitative Assessment of PerformanceOverview of the TalkQuality of Individual Visual WordsExamples of Individual Visual WordsResults of Individual Word SearchesQuality of Individual Visual WordsOverview of the TalkSearching for Objects From Outside of the MovieSample External Query ResultsOverview of the TalkChallenge I: Visual Vocabularies for Very Large Scale RetrievalSlide 54Challenge II: Retrieval of 3D ObjectsSlide 56Proposed approach 1 (cont.)Slide 58Challenge III: Verification using Spatial StructureVerification Using Spatial Structure (cont.)Verification Using Spatial Structure (cont.)Overview of the TalkConclusionLooking forward…Overview of the TalkDemo of Retrieval AlgorithmMain ReferencesJ O S E F S I V I C A N D A N D R E W Z I S S E R M A NP R E S E N T E R S : I L G E A K KAYA & J E A N N E T T E C H A N GM A R C H 1 , 2 0 1 1Efficient Visual Search for Objects in VideosIntroductionText QueryResults: Documents Image Query Results: Frames Generalize text retrieval methods to non-textual information Generalize text retrieval methods to non-textual informationState-of-the-Art before this paper…Text-based search for images (Google Images)Object recognitionBarnard, et al. (2003): “Matching words and pictures”Sivic, et al. (2005): “Discovering objects and their location in images”Sudderth, et al. (2005): “Learning hierarchical models of scenes, objects, and parts”Scene classificationFei-Fei and Perona (2005): “A Bayesian hierarchical model for learning natural scene categories”Quelhas, et al. (2005): “Modeling scenes with local descriptors and latent aspects”Lazebnik, et al. (2006): “Beyond bag of features: Spatial pyramid matching for recognizing natural scene categories”Introduction (cont.)Retrieve specific objects vs. categories of objects/scenes (“Camry” logo vs. cars)Employ text retrieval techniques for visual search, with images as queries and resultsWhy Text Retrieval Approach?Matches essentially precomputed so that no delay at run timeAny object in video can be retrieved without modification of descriptors originally built for videoOverview of the TalkVisual Search Algorithm Offline Pre-Processing Real-Time QueryA Few Implementation DetailsPerformanceGeneral ResultsTesting Individual Words Using External Images As QueriesA Few Challenges and Future DirectionsConcluding RemarksDemo of the AlgorithmOverview of the TalkVisual Search Algorithm Offline Pre-Processing Real-Time QueryA Few Implementation DetailsPerformanceGeneral ResultsTesting Individual Words Using External Images As QueriesA Few Challenges and Future DirectionsConcluding RemarksDemo of the AlgorithmPre-Processing (Offline)1. For each frame, detect affine covariant regions. 2. Track the regions through video and reject unstable regions3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document frequency vectors6. Built inverted file-indexing structureTypically ~1200 regions / frame (720x576)Elliptical regionsEach region represented by 128-dimensional SIFT vectorSIFT features provide invariance against affine transformationsDetection of Affine Covariant RegionsTwo types of affine covariant regions:1. Shape-Adapted(SA): Mikolajczyk et al.Elliptical Shape adaptation about a Harris interest pointOften centered on corner-like features2. Maximally-Stable(MS):Proposed by Matas et al.Intensity watershed image segmentationHigh-contrast blobsPre-Processing (Offline)1. For each frame, detect affine covariant regions. 2. Track the regions through video and reject unstable regions3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document frequency vectors6. Built inverted file-indexing structureTracking regions through video and rejecting unstable regionsAny region that does not survive for 3+ frames is rejectedThese regions are not potentially interestingReduces number of regions/frame to approx. 50% (~600/frame)Pre-Processing (Offline)1. For each frame, detect affine covariant regions. 2. Track the regions through video and reject unstable regions3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document frequency vectors6. Built inverted file-indexing structureVisual Indexing Using Text-Retrieval MethodsTEXT IMAGERepresent words by the “stems”‘write’‘writing’ ‘write’‘written’ mapped to Cluster similar regions into ‘visual words’Stop-list common words ‘a/an/the’Stop-list common visual wordsRank search results according to how close the query words occur within retrieved documentUse spatial information to check retrieval consistencyVisual VocabularyPurpose: Cluster regions from multiple frames into fewer groups called ‘visual words’Each descriptor: 128-vectorK-means clustering (explain more)~300K descriptors mapped into 16K visual words(600 regions/frame x ~500 frames)(6000 SA, 10000 MS regions used)K-Means ClusteringPurpose: Cluster N data points (SIFT descriptors) into K clusters (visual words)K = desired number of cluster centers (mean points)Step 1: Randomly guess K mean pointsStep 2: Calculate nearest mean point to assign each data point to a cluster centerIn this paper, Mahalanobis distance is used to determine ‘nearest cluster center’. where ∑ is the covariance matrix for all descriptors,x2 is the length 128 mean

View Full Document