MIT 9 459 - Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes - D1673002

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 459> Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

DOC PREVIEW

MIT 9 459 - Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

School name Massachusetts Institute of Technology

Course 9 459- Scene Understanding Symposium

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes Kevin Murphy Antonio Torralba William T. Freeman Abstract Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random ﬁeld for jointly solving the tasks of object detection and scene classiﬁcation. 1 Introduction Standard approaches to object detection (e.g., [24, 15]) usually look at local pieces of the image in isolation when deciding if the object is present or not at a particular location/ scale. However, this approach may fail if the image is of low quality (e.g., [23]), or the object is too small, or the object is partly occluded, etc. In this paper we propose to use the image as a whole as an extra global feature, to help overcome local ambiguities. There is some psychological evidence that people perform rapid global scene analysis be-fore conducting more detailed local object analysis [4, 2]. The key computational question is how to represent the whole image in a compact, yet informative, form. Torralba [21] suggests a representation, called the “gist” of the image, based on PCA of a set of spatially averaged ﬁlter-bank outputs. The gist acts as an holistic, low-dimensional representation of the whole image. Torralba shows that this is sufﬁcient to provide a useful prior for what types of objects may appear in the image, and at which location/scale. We extend [21] by combining the prior suggested by the gist with the outputs of bottom-up, local object detectors, which are trained using boosting (see Section 2). Note that this is quite different from approaches that use joint spatial constraints between the locations of objects, such as [11, 20, 19, 8]. In our case, the spatial constraints come from the image as a whole, not from other objects. This is computationally much simpler (see Section 3). Another task of interest is detecting if the object is present anywhere in the image, regard-less of location. (This can be useful for object-based image retrieval.) In principle, this is straightforward: we declare the object is present iff the detector ﬁres (at least once) at any location/scale. However, this means that a single false positive at the patch level can cause a 100% error rate at the image level. As we will see in Section 4, even very good detectors can perform poorly at this task. The gist, however, is able to perform quite well at suggesting the presence of types of objects, without using a detector at all. In fact, we canuse the gist to decide if it is even “worth” running a detector, although we do not explore this here. Often, the presence of certains types of objects is correlated, e.g., if you see a keyboard, you expect to see a screen. Rather than model this correlation directly, we introduce a hidden common cause/ factor, which we call the “scene”. In Section 5, we show how we can reliably determine the type of scene (e.g., ofﬁce, corridor or street) using the gist. Scenes can also be deﬁned in terms of the objects which are present in the image. Hence we combine the tasks of scene classiﬁcation and object-presence detection using a tree-structured graphical model: see Section 6. We perform top-down inference (scenes to objects) and bottom-up inference (objects to scenes) in this model. Finally, we conclude in Section 7. 2 Object detection and localization For object detection there are at least three families of approaches: parts-based (an object is deﬁned as a speciﬁc spatial arrangement of small parts e.g., [6]); patch-based (we classify each rectangular image region, of a ﬁxed aspect ratio (shape) but at multiple sizes, as object or background); and region-based (a region of the image is segmented from the background and is described by a set of features that provide texture and shape information e.g., [5]). Here we use a patch-based approach. For objects with rigid, well-deﬁned shapes (screens, keyboards, people, cars), a patch usually contains the full object and a small portion of the background. For the rest of the objects (desks, bookshelves, buildings), rectangular patches may contain only a piece of the object. In that case, the region covered by a number of patches deﬁnes the object. In such a case, the object detector will rely mostly on the textural properties of the patch. The main advantage of the patch-based approach is that object-detection can be reduced to ca binary classiﬁcation problem. Speciﬁcally, we compute P (Oc = 1|vi ) for each class ci and patch i (ranging over location and scale), where Oc = 1 if patch i contains (part of) an i cinstance of class c, and Oc = 0 otherwise; vi is the feature vector (to be described below) i for patch i computed for class c. To detect an object, we slide our detector across the image pyramid and classify all the patches at each location and scale (20% increments of size and every other pixel in loca-tion). After performing non-maximal suppression [1], we report as detections all locations cfor which P (Oic|vi ) is above a threshold, chosen to given a desired trade-off between false positives and missed detections. 2.1 Features for objects and scenes We would like to use the same set of features for detecting a variety of object types, as well as for classifying scenes. Hence we will create a large set of features and use a feature selection algorithm (Section 2.2) to select the most discriminative subset. We compute component k of the feature vector for image patch i in three steps, as follows. First we convolve the (monochrome) patch Ii(x) with a ﬁlter gk(x), chosen from the set of 13 (zero-mean) ﬁlters shown in Figure 1(a). This set includes oriented edges, a Laplacian ﬁlter, corner detectors and long edge detectors. These features can be computed efﬁciently: The ﬁlters used can be obtained by convolution of 1D ﬁlters (for instance, the long edge ﬁlters are obtained by the convolution of the two ﬁlters [−1 0 1]T and [1 1 1 1 1 1]) or as linear combinations of the other ﬁlter outputs (e.g., the ﬁrst six ﬁlters are steerable). We can summarize the response of the patch convolved with the ﬁlter, |Ii(x)∗ gk(x)|, using a histogram. As shown in [7], we can further summarize this

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

MIT 9 459 - Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

Sign up for free to view:

Please select your school