Unformatted text preview:

HOP: Hierarchical Object ParsingIasonas KokkinosLaboratoire MAS, Ecole Centrale de ParisEquipe GALEN, INRIA Saclay - Ile-de-France, OrsayAlan YuilleUniversity of California at Los Angeles∗AbstractIn this paper we consider the problem of object parsing,namely detecting an object and its components by compos-ing them from image observations. Apart from object local-ization, this involves the question of combining top-down(model-based) with bottom-up (image-based) information.We use an hierarchical object model, that recursively de-composes an object into simple structures. Our first contri-bution is the formulation of composition rules to build theobject structures, while addressing problems such as con-tour fragmentation and missing parts. Our second contri-bution is an efficient inference method for object parsingthat addresses the combinatorial complexity of the problem.For this we exploit our hierarchical object representation toefficiently compute a coarse solution to the problem, whichwe then use to guide search at a finer level. This rules out alarge portion of futile compositions and allows us to parsecomplex objects in heav ily cluttered scenes.1. IntroductionThe problem we address in this work is to parse an ob-ject in a scene, as shown in Fig. 1. By parsing we meandetecting an object by composing all of its structures usinga sparse representation of the image. This is a most chal-lenging task as the object structures can deform, some maybe missing, which combined with a cluttered backgroundproviding numerous tokens leads to a combinatorial explo-sion. However, by accurately parsing an object we can notonly localize it, but also track it or segment it, without solv-ing each problem from scratch.For this we use an hierarchical object representation, de-scribed in Sec. 3 and shown in Fig. 1(a), which graduallydecomposes an elaborate object model into simpler imagestructures. Specifically, objects are decomposed into parts,which in turn break into contours, which in the end producestraight edge segments (‘tokens’). The latter are extractedby using the Pb edge detector [19] and line segmentation.We phrase detection as finding an optimal sequence ofcompositions that start from these edge tokens and leadin the end to the whole object. This amounts to build-ing a parse tree, as shown in Fig. 1(b). The leaves of∗Work supported by NSF grants 0413214, 0613563, and 0736015.PartsObjectContoursTokens(a) Hierarchical Representation (b) Parse Tree(c) Image Tokens (d) Parsed ObjectFigure 1:Object Parsing Task: Our goal is to compose objectsusing image tokens based on an hierarchical object representation.This amounts to building a parse tree, that indicates how imagetokens are related to object structures.this tree (‘terminals’) are edge tokens, and the color-codednodes correspond to intermediate object structures; in get-ting closer to the root more complex structures are formed,involving more parts, while the root of the tree is the wholeobject structure, shown in Fig. 1(d). In Sec. 4 we ex-plain how we can go from the hierarchical representationin Fig. 1(a) to the parse tree in Fig. 1(b)There is a huge number of candidate parse trees, mostof which will have low likelihood under our probabilisticmodel. In Sec. 6 we describe how we efficiently find thosefew parse trees which have high likelihood.Our first contribution is building efficient and flexiblecomposition rules that deal with problems such as contourfragmentation and missing parts. Our second contributionlies in controlling the combinatorial complexity of the prob-lem, by using a parsing algorithm that exploits our objectrepresentation. We propose a ‘structure coarsening’ oper-ation that simplifies the composition of structures and re-duces the problem to a simpler one, allowing us to performcoarse-to-fine detection. This rules out a vast portion of thesolution space in a top-down manner and makes feasible thecomputation of optimal parses for real images with heavyclutter, where plain bottom-up detection can get trapped in1futile compositions.2. Previous WorkOur work builds on the the hierarchical compositionalapproach, e.g. [10, 14, 32, 2], which models complex struc-tures by putting together simpler parts. Even though con-ceptually appealing, how to best perform inference on hier-archical models is not a solved problem.The authors of [14]perform depth-first search with a restricted representationinvolving discrete variables, and correct its results using avariant of reranking. Regarding using continuous attributes,as we do in this work, they mention it is ‘potentially un-manageable in a Markov system’. In the work of S. C. Zhuand coworkers [12, 28] proposals are generated from dis-criminative models such as AdaBoost/Ransac in a bottom-up manner and are then validated by object/scene models.Instead, we use a single generative model to both suggestobject locations during coarse-level search and to validatethe detection results at a finer level, in the framework of anintegrated optimization algorithm. In more recent work [3]a pruned version of dynamic programming is used to effi-ciently detect objects at an initial stage and then refine themin a top-down manner. The behavior of this method is hardto predict, while we rely on A* which has guaranteed op-timality properties; further we provide results on datasetsthat are more challenging for localization, where the objectoccupies a small portion (5-10%) of the image domain.Coming to works that are more closely related to ourmethod at the technical level, in [5] inference with an hi-erarchical model was presented, but limiting the object rep-resentation to a single contour. In [23] the detection of ge-ometric structures, such as salient open and convex curveswas formulated as a parsing problem. In our work we ex-tend this approach to deal with high-level structures, i.e.objects with many parts and of potentially different types,while applying it in a Coarse-to-Fine manner instead of us-ing Best-First-Search. The combinatorics of matching havebeen studied for rigid objects [11], and [20] also used A∗fordetecting object instances, while in [16] branch-and-boundis used for efficient detection with a bag-of-words model.Finally, as in recent works for object detection [26, 22,8, 31] and earlier ones on grouping [18, 11, 13], we use acontour-based object representation. We argue that as con-tours cover a larger portion of the object than interest


View Full Document

UCLA STATS 238 - A238_ikokkinosCVPR2009

Download A238_ikokkinosCVPR2009
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A238_ikokkinosCVPR2009 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A238_ikokkinosCVPR2009 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?