Inferring Depth from Single Images in NaturalScenesByron BootsDepartment of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA [email protected] inverse optics problem is one of the oldest and most well knownproblems in visual perception. Inferring the underlying sources of visualimages has no analytic solution. Recent work on brightness, color, andform has suggested that visual percepts represent the probable sources ofvisual stimuli and not the stimuli as such, suggesting an empirical theoryof visual perception. Here I explore this idea by framing the perceptionof depth as a machine learning problem. I apply two algorithms withvarying levels of model complexity, compare their ability to infer depth,both with each other and with the best previous solutions.1 IntroductionIt has long been recognized that sources of visual stimuli cannot be uniquely specified bythe energy that reaches sensory receptors; the same pattern of light projected onto the retinamay arise from different combinations of illumination, reflectance and transmittance, andfrom objects of different sizes, at different distances and in different orientations (Figure1). Nevertheless, visual agents must respond to real-world events. The inevitably uncertainsources of visual stimuli thus present a quandary: although the physical properties of astimulus cannot uniquely specify its provenance, success depends on behavioral responsesthat are appropriate to the stimulus source. This dilemma is referred to as the inverse opticsproblem.For more than a century now, investigators have surmised that the basis of successful bi-ological vision in the face of the inverse optics problem is the inclusion of prior experi-ence in visual processing, presumably derived from both evolution and individual devel-opment. This empirical influence on visual perception, first suggested by George Berke-ley in 1709[1]. has been variously considered in terms of Helmholtz’s ”unconscious in-ferences,”[2] the ”organizational principles” advocated by gestalt psychology[3], and theframework of ”ecological optics” developed by Gibson[4]. More recently, these broadinterpretations have been bolstered by a wealth of evidence suggesting that many visualpercepts can be predicted according to the real-world sources to which an animal has al-ways been exposed[5,6,7]. In fact, many of the anomalous percepts that humans see inresponse to simple visual stimuli may be rationalized in this way[6,7].In the present work, I have explored the notion of an empirical approach to visual percep-tion by framing the inverse optics problem as a machine learning problem. Specifically, Ihave asked how depth to surfaces in natural scenes may be inferred from monocular im-ages. I look at previous approaches to the problem and suggest novel alternatives. Myresults demonstrate the feasibility of solving the inverse problem from two perspectives: anaive linear regression perspective and from a more complex graphical modeling perspec-tive.Figure 1: The inverse optics problem with respect to geometry. Objects of different sizesat different distances and in different orientations may project the same image on the retia.2 Related Work2.1 Traditional approaches to computer visionGeometrical aspects of the inverse optics problem are frequently encountered in computervision in the form of recovering three-dimensional structure from two-dimensional images.Most work in this area has focused on stereopsis [8], structure from motion [9], or depthfrom defocus [10]; all of which rely on a differential comparison between multiple images.Animals, however, are able to judge spatial geometry from a monocular image, and it isthought that this ability lies at the heart of the perception of geometrical space[11]. Despitethis fact, the only well-known method of inferring depth from a single image is the ”shapefrom shading” algorithm[12], a technique that devises models of image formation based onthe physics of light interaction and then inverts the models to solve for depth. These in-verted models are highly underconstrained, requiring many simplifying assumptions (e.g.,Lambertian surface reflectance) that seldom hold in images of natural scenes[14]. Recentlyresearchers have begun to recover geometrical structure from two dimensional images em-pirically using learning-based techniques.2.2 Learning-based methods in computer visionDespite the large quantity of evidence suggesting the importance of empirical data in vi-sion, there have been surprisingly few attempts to leverage machine learning techniques toinfer scene geometry from monocular images. I am only aware of two different methods.Andrew Ng’s group at Stanford University is using discriminatively trained Markov ran-dom fields to infer depth from monocular images collected from a mobile platform [13].This approach is quite successful and has the advantage of directly learning depth-mapsbased on statistics of images and their underlying sources. Potentially, such an approachcould be tied to vision studies which have similarly used images and depth maps to explainperceptual phenomena [5, 14].Aloysha Efros’ group at Carnegie Mellon University is using a completely different tech-nique where subjects hand-label the possible orientation of surfaces in images[15]. Theiralgorithm learns geometric classes defined by simple orientations such as sky, ground, andvertical surfaces in a scene. The labels are then used to ”cut and fold” the image providinga simple ”pop-up” model of a visual scene. This method performs surprisingly well for awide range of images and is visually appealing, but is highly inaccurate and not directlyrelated to the the true statistics of underlying depths in visual images.2.3 Filtering in visual inferenceIn previous attempts at monocular inference it has been suggested that a variety cues areessential for judging depth. In particular, convolutional filters such as Laws’ masks fortexture energy and oriented edge detectors have been used to develop complex feature vec-tors describing local information in image patches [13,15]. Additionally, the local patchesthemselves are augmented with information from multi-scale decompositions of the imagein order to provide additional scene context[13]. This latter point is extremely important aslocal information is insufficient to determine depth [6,7].Despite their extensive use, linear filters are problematic. Features derived in this
View Full Document