MIT 9 459 - TOP-DOWN CONTROL OF VISUAL ATTENTION - D2810215

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 459> TOP-DOWN CONTROL OF VISUAL ATTENTION

DOC PREVIEW

MIT 9 459 - TOP-DOWN CONTROL OF VISUAL ATTENTION

School name Massachusetts Institute of Technology

Course 9 459- Scene Understanding Symposium

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

TOP-DOWN CONTROL OF VISUAL ATTENTION IN OBJECT DETECTION ABSTRACT Current computational models of visual attention focus on bottom-up information and ignore scene context. However, studies in visual cognition show that humans use context to facilitate object detection in natural scenes by directing their attention or eyes to diagnostic regions. Here we propose a model of attention guidance based on global scene conﬁg-uration. We show that the statistics of low-level features across the scene image determine where a speciﬁc object (e.g. a person) should be located. Human eye movements show that regions chosen by the top-down model agree with regions scrutinized by human observers performing a visual search task for people. The results validate the proposition that top-down information from visual context modulates the saliency of image regions during the task of object de-tection. Contextual information provides a shortcut for efﬁ-cient object detection systems. 1. INTRODUCTION While looking for a speciﬁc object in a complex and clut-tered scene, human observers use visual context information to facilitate the search, by directing their attention or eyes to relevant regions in the image (e.g. in the street when search-ing for cars, on a table searching for a plate). This strategy is not considered by current computational models of visual attention [3, 7], which focus on the saliency zones of the image, independently of the meaning of the scene. In this paper, we describe a computational model of at-tention guidance that takes into account the visual context (e.g. the scene) in which objects are embedded [10, 11]. We show that the statistics of low-level features across a natural scene is strongly correlated with the location of a speciﬁc object. In the current study, the scheme is tested with the task of locating probable locations of people in scenes, and these selected regions are compared to human eye move-ment scan patterns. Models that integrate attention mechanisms are relevant for computer vision as they can suggest strategies for ﬁnd-ing shortcuts for object detection and recognition. These shortcuts can be used to select a set of possible candidate locations of target objects within an image. Then, computa-tionally more expensive object recognition procedures can be applied in those regions [5]. In this paper, we propose a simple attentional mechanism that does not use speciﬁc information about the appearance of the target. Instead we use a simple model of image saliency based on the distribu-tion of local features in the image and a model of contextual priors (that learns the relationship between context features and the location of the target during past experience) in or-der to select interesting regions of the image. The paper shows that there could exist pre-attentive heuristics based on the context within which an object is embedded, that would provide a low-cost object detection shortcut by pre-selecting relevant image regions. 2. SALIENCY AND OBJECT DETECTION For bottom-up models of attention allocation, regions with different properties from their neighboring regions are con-sidered more informative and are supposed to attract atten-tion. Those models provide a measure of the ’saliency’ of each location in the image across various low-level fea-tures (contrast, color, orientation, texture, motion, [3, 13]). Saliency measures are interesting in the framework of object detection because, when looking for a target object, frequent features in the image are more likely to belong to the back-ground and, therefore, are poor predictors of the presence of the target. In saliency models, a saliency map is computed using a hardwire scheme (e.g., [3]): the local image features are processed by center-surround inhibition and then a winner take all strategy is used to select the most salient regions. The image features most commonly used for describing lo-cal image structure (orientation, scale and texture) are the outputs of multiscale oriented band-pass ﬁlters. Here, we decompose each color subband using a steerable pyramide [9] with 4 scales and 4 orientations (ﬁg. 1). Each location has a features vector vl(x)={vl(x,k)}k=1,48 with 48 di-mensions (ﬁg. 1). Here, we deﬁne the saliency in terms of the likelihoodp(vl)vl(x,y) vc p(o, x | vc) Sc(x) Fig. 1. Attentional system for object detection integrating local saliency and contextual priors about target location. of ﬁnding a set of local features in the image. We use a probabilistic deﬁnition of saliency that more naturally ﬁts with object detection and recognition formulations we later show : −1S(x)=p(vl) (1) In this deﬁnition, the saliency of a location is large when the image features at that location are more unexpected in the image. We approximate this probability by ﬁtting a gaussian to the distribution of local features in the image ([8]): −1/2(vl−µ)T X−1(vl−µ)e(2)p(vl)=(2π)N/2|X|1/2 Although a mixture of gaussians produces a better ﬁt of the distribution, it did not signiﬁcantly change the selected salient points. As discussed later (ﬁg. 2), the accuracy of this model in predicting the ﬁxated points by human sub-jects did not differ with the performance of a more complex model of saliency maps [3]. 3. CONTEXTUAL OBJECT PRIMING However, when looking for an object, the use of saliency S(x) as deﬁned in eq. (1) is insufﬁcient for explaining hu-man performance or for building interesting object detection procedures. During the ﬁrst glance at a scene (or 200 msec), the attention of the observer is driven towards a region in the image and the ﬁrst saccade is programmed. This process is task-dependent. When subjects are asked to search for a speciﬁc target object, that object is ﬁxated (and so located) faster when it is consistent with the scene context than when it is inconsistent [2]. Human observers are clearly using a top-down mechanism to ﬁnd regions of interest where an object should be located, independent of the presence of the physical features of the object [2, 1]. 3.1. Contextual modulation of saliency The role of the visual context is to provide information about past search experiences in similar environments and strate-gies that were successful in ﬁnding the target. When using a statistical framework, object detection is formulated as the evaluation of the probability function p(o| vl). This is the

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

MIT 9 459 - TOP-DOWN CONTROL OF VISUAL ATTENTION

Sign up for free to view:

Please select your school