DOC PREVIEW
MIT 9 459 - Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes Kevin Murphy Antonio Torralba William T. Freeman Abstract Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random field for jointly solving the tasks of object detection and scene classification. 1 Introduction Standard approaches to object detection (e.g., [24, 15]) usually look at local pieces of the image in isolation when deciding if the object is present or not at a particular location/ scale. However, this approach may fail if the image is of low quality (e.g., [23]), or the object is too small, or the object is partly occluded, etc. In this paper we propose to use the image as a whole as an extra global feature, to help overcome local ambiguities. There is some psychological evidence that people perform rapid global scene analysis be-fore conducting more detailed local object analysis [4, 2]. The key computational question is how to represent the whole image in a compact, yet informative, form. Torralba [21] suggests a representation, called the “gist” of the image, based on PCA of a set of spatially averaged filter-bank outputs. The gist acts as an holistic, low-dimensional representation of the whole image. Torralba shows that this is sufficient to provide a useful prior for what types of objects may appear in the image, and at which location/scale. We extend [21] by combining the prior suggested by the gist with the outputs of bottom-up, local object detectors, which are trained using boosting (see Section 2). Note that this is quite different from approaches that use joint spatial constraints between the locations of objects, such as [11, 20, 19, 8]. In our case, the spatial constraints come from the image as a whole, not from other objects. This is computationally much simpler (see Section 3). Another task of interest is detecting if the object is present anywhere in the image, regard-less of location. (This can be useful for object-based image retrieval.) In principle, this is straightforward: we declare the object is present iff the detector fires (at least once) at any location/scale. However, this means that a single false positive at the patch level can cause a 100% error rate at the image level. As we will see in Section 4, even very good detectors can perform poorly at this task. The gist, however, is able to perform quite well at suggesting the presence of types of objects, without using a detector at all. In fact, we canuse the gist to decide if it is even “worth” running a detector, although we do not explore this here. Often, the presence of certains types of objects is correlated, e.g., if you see a keyboard, you expect to see a screen. Rather than model this correlation directly, we introduce a hidden common cause/ factor, which we call the “scene”. In Section 5, we show how we can reliably determine the type of scene (e.g., office, corridor or street) using the gist. Scenes can also be defined in terms of the objects which are present in the image. Hence we combine the tasks of scene classification and object-presence detection using a tree-structured graphical model: see Section 6. We perform top-down inference (scenes to objects) and bottom-up inference (objects to scenes) in this model. Finally, we conclude in Section 7. 2 Object detection and localization For object detection there are at least three families of approaches: parts-based (an object is defined as a specific spatial arrangement of small parts e.g., [6]); patch-based (we classify each rectangular image region, of a fixed aspect ratio (shape) but at multiple sizes, as object or background); and region-based (a region of the image is segmented from the background and is described by a set of features that provide texture and shape information e.g., [5]). Here we use a patch-based approach. For objects with rigid, well-defined shapes (screens, keyboards, people, cars), a patch usually contains the full object and a small portion of the background. For the rest of the objects (desks, bookshelves, buildings), rectangular patches may contain only a piece of the object. In that case, the region covered by a number of patches defines the object. In such a case, the object detector will rely mostly on the textural properties of the patch. The main advantage of the patch-based approach is that object-detection can be reduced to ca binary classification problem. Specifically, we compute P (Oc = 1|vi ) for each class ci and patch i (ranging over location and scale), where Oc = 1 if patch i contains (part of) an i cinstance of class c, and Oc = 0 otherwise; vi is the feature vector (to be described below) i for patch i computed for class c. To detect an object, we slide our detector across the image pyramid and classify all the patches at each location and scale (20% increments of size and every other pixel in loca-tion). After performing non-maximal suppression [1], we report as detections all locations cfor which P (Oic|vi ) is above a threshold, chosen to given a desired trade-off between false positives and missed detections. 2.1 Features for objects and scenes We would like to use the same set of features for detecting a variety of object types, as well as for classifying scenes. Hence we will create a large set of features and use a feature selection algorithm (Section 2.2) to select the most discriminative subset. We compute component k of the feature vector for image patch i in three steps, as follows. First we convolve the (monochrome) patch Ii(x) with a filter gk(x), chosen from the set of 13 (zero-mean) filters shown in Figure 1(a). This set includes oriented edges, a Laplacian filter, corner detectors and long edge detectors. These features can be computed efficiently: The filters used can be obtained by convolution of 1D filters (for instance, the long edge filters are obtained by the convolution of the two filters [−1 0 1]T and [1 1 1 1 1 1]) or as linear combinations of the other filter outputs (e.g., the first six filters are steerable). We can summarize the response of the patch convolved with the filter, |Ii(x)∗ gk(x)|, using a histogram. As shown in [7], we can further summarize this


View Full Document

MIT 9 459 - Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

Download Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?