Using the Forest to See the Trees: Context-based Object Recognition Bill Freeman Computer Science and Artificial Intelligence Laboratory MIT A computer vision goal • many viewing conditions in unconstrained settings. • restricted cases: • But the general problem is difficult and unsolved. Joint work with Antonio Torralba and Kevin Murphy Recognize many different objects under There has been progress on – one object and one pose (frontal view faces) – Isolated objects on uniform backgrounds.• • How we hope to make progress on this hard problem Classify image patches/features at each location and scale features No car Classifier p( car | VL ) VL Local (bottom-up) approach to object detection Various technical improvements Exploit scene context: – “if this is a forest, these must be trees”. LocalProblem 1: Local features can be ambiguous Solution 1: Context can disambiguate local featuresEffect of context on object detection car pedestrian Identical local image features! Even high-resolution images can be locally ambiguous Images by Antonio TorralbaObject in context (Courtesy of Fredo Durand and William Freeman. Used with permission.)Isolated object Object in contextProblem 2: search space is HUGE x 1,000,000 images/day Plus, we want to do this for ~ 1000 objects y s positive rate) “Like finding needles in a haystack” Need to search over x,y locations and scales s - Error prone (classifier must have very low false - Slow (many patches to examine) 10,000 patches/object/imageSolution 2: context can provide a prior on what to look for, and where to look for it People most likely here Torralba, IJCV 2003 cars 1.0 0.0 n Talk outline • Context-based vision • • pedestriacomputer desk Computers/desks unlikely outdoors Feature-based object detection Graphical model to combine both sourcesTalk outline • Context-based vision • • Context-based vision • • • Combine with bottom-up object detection • training set acquisition. Feature-based object detection Graphical model to combine both sources Measure overall scene context or “gist” Use that scene context for: – Location identification – Location categorization – Top-down info for object recognition Future focus:Contextual machine-vision system • Low-dimensional representation of overall scene: – Gabor-filter outputs at multiple scales, orientations, locations – Dimensionality reduction via PCA Feature vector for an image: the “gist” of the scene – Compute 12 x 30 = 360 dim. feature vector – over 4x4 regions = 384 dim. feature vector – Reduce to ~ 80 dimensions using PCA The “Visual Gist” System Or use steerable filter bank, 6 orientations, 4 scales, averaged Oliva & Torralba, IJCV 2001Low-dimensional representation for image context Images 80-dimensional representation Hardware set-up • • Computer: Sony laptop • Wearable system – Gives immediate feedback to the user – Must handle general camera view – Capable of wireless link for audience display Designed for utility, not fashion…Our mobile rig, version 1 Kevin Murphy Our mobile rig, version 2. Antonio Torralba (Courtesy of Kevin Murphy. Used with permission.)(Courtsey of Antonio Torralba. Used with permission.)Experiments – th floor of 200 Tech. Square – • Test: – th floor (seen in training) – – • – – Specific location Location category Indoor/outdoor Ground truth System estimate Location recognition for mobile vision system •Train: Rooms and halls on 9Outdoors Interior of 200 Tech. Square, 9Interior of 400 Tech. Square (unseen) Outdoors (unseen places) Goals: Identify previously seen locations Identify category of previously unseen locationsClassifying isolated scenes can be hard Corridors Offices Correct recognition misses Correct recognition misses Scene recognition over time PCt-1 Ok k1… Pkn PCt … Os s1… Psn VsVk1 Vkn VGVs1 n P(Ct|Ct-1) is a transition matrix, P(vG|C) is a mixture of Gaussians Cf. topological localization in robotics Torralba, Murphy, Freeman, Rubin, ICCV 2003Benefit of using temporal integration G Place recognition demo p( qt | vt ) Instantaneous detection P( qt | v1:t )G Using HMM over timeCategorization of new places frame Specific location Location category Indoor/outdoor Top-down information for object detectionTalk outline • Context-based vision • • Bottom-up object recognition • • (each view of an object) Feature-based object detection Graphical model to combine both sources Use labelled training set Use local features to categorize each objectTraining data •Hand-annotated 1200 frames of video from a wearable webcam •Trained detectors for 9 types of objects: bookshelf, desk, screen (frontal) , steps, building facade, etc. •100-200 positive patches, > 10,000 negative patches Feature vector for a patch: step 1 derivatives Laplacian Corner Long edges convolve bank of 12 filters GaussianFeature vector for a patch: step 2 exponentiate γ = 2 (variance) or 4 (4th moment) Kurtosis Useful for texture analysis Feature vector for a patch: step 3 dictionary of 30 spatial masks .* mask characterizes shape of filter response bank of 12 filtersFeature vector for a patch: step 4 dictionary of 30 spatial masks .* 57.3 Average response γk = 2 (variance) or 4 (4th moment) Summary: Features image 12 x 30 x 2 = 720 features. Special cases include: -gk = delta function, wk -fi(γ)=4/ fi(γ=2) gives kurtosis for texture analysis -wk mask to capture spatial arrangement of parts dictionary of 12 filtersdictionary of 30 masks bank of 12 filters k’th feature of i’th patch i’th patch = Haar wavelets – Viola & Jones, Poggio et al Rectangular masks support integral image trick for fast computationClassifier: boosted features where – –ht(f) = output of weak classifier at round t −αt = weight assigned by boosting • ht(f) picks best feature and threshold: • • • Viola & Jones, IJCV 2001 Boosting demo •Output is f = feature vector for patch Weak learners are single features: ~500 rounds of boosting ~200 positive patches, ~ 10,000 negative patches No cascade (yet)Examples of learned features Example detections deskscreenExample detections desk screen bookshelf Bottom-up detection: ROC curvesTalk outline • Context-based vision • • Probabilistic models: graphical models • • Build up complex models from simple components describing conditional independence assumptions. • combine evidence from different parts of the
View Full Document