Using the Forest to see the Trees: A computational model relating features, objects and scenes Antonio Torralba CSAIL-MIT Joint work with Aude Oliva, Kevin Murphy, William Freeman Monica Castelhano, John HendersonFrom objects to scenes Object localization Local features Riesenhuber & Poggio (99); Vidal-Naquet & Ullman (03); Serre & Poggio, Image I L L L L O2 O2 O2 O2 SSceneType 2 {street, office, …} O1 O1 O1 O1 (05); Agarwal & Roth, (02), Moghaddam, Pentland (97), Turk, Pentland (91),Vidal-Naquet, Ullman, (03) Heisele, et al, (01), Agarwal & Roth, (02), Kremp, Geman, Amit (02), Dorko, Schmid, (03) Fergus, Perona, Zisserman (03), Fei Fei, Fergus, Perona, (03), Schneiderman, Kanade (00), Lowe (99)From scenes to objects Object localization SSceneType 2 {street, office, …} Image Local features L L L I O1 O1 O1 O1 O2 O2 O2 O2 L G Global gist featuresFrom scenes to objects Object localization SSceneType 2 {street, office, …} Image Local features L L L I O1 O1 O1 O1 O2 O2 O2 O2 L G Global gist featuresThe context challenge What do you think are the hidden objects? 2 1 Biederman et al 82; Bar & Ullman 93; Palmer, 75;The context challenge What do you think are the hidden objects? Chance ~ 1/30000 Answering this question does not require knowing how the objects look like. It is all about context.From scenes to objects SSceneType 2 {street, office, …} Image L L L I L G Global gist features Local featuresScene categorization Office Corridor Street Oliva & Torralba, IJCV’01; Torralba, Murphy, Freeman, Mark, CVPR 03.Place identification Office 610 Office 615 Draper street 59 other places… Scenes are categories, places are instancesSupervised learning { Vg , Office} { Vg, Office} { Vg , Corridor} Classifier { Vg , Street} …Supervised learning { Vg , Office} { Vg , Classifier { Vg , Corridor} Office} { Vg , Street} Which feature vector for a whole image?…Global features (gist) First, we propose a set of features that do not encode specific object information Oliva & Torralba, IJCV’01; Torralba, Murphy, Freeman, Mark, CVPR 03.Global features (gist) First, we propose a set of features that do not encode specific object information V = {energy at each orientation and scale} = 6 x 4 dimensions 80 features | vt | PCA G Oliva & Torralba, IJCV’01; Torralba, Murphy, Freeman, Mark, CVPR 03.Example visual gists I I’ Global features (I) ~ global features (I’) Cf. “Pyramid Based Texture Analysis/Synthesis”, Heeger and Bergen, Siggraph, 1995Learning to recognize places We use annotated sequences for training Office 610 Corridor 6b Corridor 6c Office 617 • Hidden states = location (63 values) • Observations = vGt (80 dimensions) • Transition matrix encodes topology of environment • Observation model is a mixture of Gaussians centered on prototypes (100 views per place)Wearable test-bed v1Wearable test-bed v2Place/scene recognition demoFrom scenes to objects Object localization SSceneType 2 {street, office, …} Image Local features L L L I O1 O1 O1 O1 O2 O2 O2 O2 L G Global gist featuresGlobal scene features predicts object location vg New image Image regions likely to contain the targetGlobal scene features predicts object location Training set (cars) { Vg1,X1 } { Vg2,X2 } The goal of the training is to learn the association between the location of the target and the global scene features { Vg3,X3 } { Vg4,X4 } …Global scene features predicts object location vg X Results for predicting the Results for predicting the vertical location of people horizontal location of people True YTrue X Estimated Y Estimated XThe layered structure of scenes p(x) p(x2|x1) In a display with multiple targets present, the location of one target constraints the ‘y’ coordinate of the remaining targets, but not the ‘x’ coordinate.Global scene features predicts object location vg X Stronger contextual constraints can be obtained using other objects.11Attentional guidance Saliency models: Koch & Ullman, 85; Wolfe 94; Itti, Koch, Niebur, 98; Rosenholtz, 99 SaliencyLocal featuresAttentional guidance Local features Global features Saliency Scene prior TASK Torralba, 2003; Oliva, Torralba, Castelhano, Henderson. ICIP 2003Attentional guidance Saliency Local features Object model Global features Scene prior TASKComparison regions of interest Torralba, 2003; Oliva, Torralba, Castelhano, Henderson. ICIP 2003Comparison regions of interest Saliency predictions 10% 20% 30% Torralba, 2003; Oliva, Torralba, Castelhano, Henderson. ICIP 2003Comparison regions of interest Saliency predictions Saliency and Global scene priors 10% 20% 30% Torralba, 2003; Oliva, Torralba, Castelhano, Henderson. ICIP 2003Comparison regions of interest Saliency predictions 10% 20% 30% Dots correspond to fixations 1-4 Torralba, 2003; Oliva, Torralba, Castelhano, Henderson. ICIP 2003Comparison regions of interest Saliency predictions Saliency and Global scene priors 10% 20% 30% Dots correspond to fixations 1-4 Torralba, 2003; Oliva, Torralba, Castelhano, Henderson. ICIP 2003Results Scenes without people Scenes with people % of 100 fixations 90inside the region 80 70 60 50 1 2 3 4 1 2 3 4 100 90 80 70 60 50 Fixation number Fixation number Contextual RegionSaliency RegionChance level: 33 %Task modulation Local features Global features Saliency Scene prior TASK Torralba, 2003; Oliva, Torralba, Castelhano, Henderson. ICIP 2003Task modulation Mug search Painting search Saliency predictions Saliency and Global scene priorsDiscussion • From the computational perspective, scene context can be derived from global image properties and predict where objects are most likely to be. • Scene context considerably improves predictions of fixation locations. A complete model of attention guidance in natural scenes requires both saliency and contextual
View Full Document