DOC PREVIEW
MIT 9 459 - TOP-DOWN CONTROL OF VISUAL ATTENTION

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

TOP-DOWN CONTROL OF VISUAL ATTENTION IN OBJECT DETECTION ABSTRACT Current computational models of visual attention focus on bottom-up information and ignore scene context. However, studies in visual cognition show that humans use context to facilitate object detection in natural scenes by directing their attention or eyes to diagnostic regions. Here we propose a model of attention guidance based on global scene config-uration. We show that the statistics of low-level features across the scene image determine where a specific object (e.g. a person) should be located. Human eye movements show that regions chosen by the top-down model agree with regions scrutinized by human observers performing a visual search task for people. The results validate the proposition that top-down information from visual context modulates the saliency of image regions during the task of object de-tection. Contextual information provides a shortcut for effi-cient object detection systems. 1. INTRODUCTION While looking for a specific object in a complex and clut-tered scene, human observers use visual context information to facilitate the search, by directing their attention or eyes to relevant regions in the image (e.g. in the street when search-ing for cars, on a table searching for a plate). This strategy is not considered by current computational models of visual attention [3, 7], which focus on the saliency zones of the image, independently of the meaning of the scene. In this paper, we describe a computational model of at-tention guidance that takes into account the visual context (e.g. the scene) in which objects are embedded [10, 11]. We show that the statistics of low-level features across a natural scene is strongly correlated with the location of a specific object. In the current study, the scheme is tested with the task of locating probable locations of people in scenes, and these selected regions are compared to human eye move-ment scan patterns. Models that integrate attention mechanisms are relevant for computer vision as they can suggest strategies for find-ing shortcuts for object detection and recognition. These shortcuts can be used to select a set of possible candidate locations of target objects within an image. Then, computa-tionally more expensive object recognition procedures can be applied in those regions [5]. In this paper, we propose a simple attentional mechanism that does not use specific information about the appearance of the target. Instead we use a simple model of image saliency based on the distribu-tion of local features in the image and a model of contextual priors (that learns the relationship between context features and the location of the target during past experience) in or-der to select interesting regions of the image. The paper shows that there could exist pre-attentive heuristics based on the context within which an object is embedded, that would provide a low-cost object detection shortcut by pre-selecting relevant image regions. 2. SALIENCY AND OBJECT DETECTION For bottom-up models of attention allocation, regions with different properties from their neighboring regions are con-sidered more informative and are supposed to attract atten-tion. Those models provide a measure of the ’saliency’ of each location in the image across various low-level fea-tures (contrast, color, orientation, texture, motion, [3, 13]). Saliency measures are interesting in the framework of object detection because, when looking for a target object, frequent features in the image are more likely to belong to the back-ground and, therefore, are poor predictors of the presence of the target. In saliency models, a saliency map is computed using a hardwire scheme (e.g., [3]): the local image features are processed by center-surround inhibition and then a winner take all strategy is used to select the most salient regions. The image features most commonly used for describing lo-cal image structure (orientation, scale and texture) are the outputs of multiscale oriented band-pass filters. Here, we decompose each color subband using a steerable pyramide [9] with 4 scales and 4 orientations (fig. 1). Each location has a features vector vl(x)={vl(x,k)}k=1,48 with 48 di-mensions (fig. 1). Here, we define the saliency in terms of the likelihoodp(vl)vl(x,y) vc p(o, x | vc) Sc(x) Fig. 1. Attentional system for object detection integrating local saliency and contextual priors about target location. of finding a set of local features in the image. We use a probabilistic definition of saliency that more naturally fits with object detection and recognition formulations we later show : −1S(x)=p(vl) (1) In this definition, the saliency of a location is large when the image features at that location are more unexpected in the image. We approximate this probability by fitting a gaussian to the distribution of local features in the image ([8]): −1/2(vl−µ)T X−1(vl−µ)e(2)p(vl)=(2π)N/2|X|1/2 Although a mixture of gaussians produces a better fit of the distribution, it did not significantly change the selected salient points. As discussed later (fig. 2), the accuracy of this model in predicting the fixated points by human sub-jects did not differ with the performance of a more complex model of saliency maps [3]. 3. CONTEXTUAL OBJECT PRIMING However, when looking for an object, the use of saliency S(x) as defined in eq. (1) is insufficient for explaining hu-man performance or for building interesting object detection procedures. During the first glance at a scene (or 200 msec), the attention of the observer is driven towards a region in the image and the first saccade is programmed. This process is task-dependent. When subjects are asked to search for a specific target object, that object is fixated (and so located) faster when it is consistent with the scene context than when it is inconsistent [2]. Human observers are clearly using a top-down mechanism to find regions of interest where an object should be located, independent of the presence of the physical features of the object [2, 1]. 3.1. Contextual modulation of saliency The role of the visual context is to provide information about past search experiences in similar environments and strate-gies that were successful in finding the target. When using a statistical framework, object detection is formulated as the evaluation of the probability function p(o| vl). This is the


View Full Document

MIT 9 459 - TOP-DOWN CONTROL OF VISUAL ATTENTION

Download TOP-DOWN CONTROL OF VISUAL ATTENTION
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view TOP-DOWN CONTROL OF VISUAL ATTENTION and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view TOP-DOWN CONTROL OF VISUAL ATTENTION 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?