DOC PREVIEW
UCLA STATS 238 - Vision as Bayesian Inference

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Vision as Bayesian Inference. Supplement to Lecture 1Alan YuilleDepartment of Statistics, UCLALos Angeles, CA [email protected]: NOT FOR DISTRIBUTION.1. IntroductionWhat is vision? Image I is generated by the state of the world W . You need to estimate W from I.Vision is very very hard. Humans are ”vision machines”. The large size of our cortex distinguishes us from other animals(except close relatives like monkeys). Roughly half the cortex is involved in vision (maybe sixty percent for monkeys). It ispossible that the size of the cortex developed because of the evolutionary need to see and interpret the world. So intelligencemight be a parasite of vision. Most animals have poor vision – e.g., cats and cobras can only see movement – and oftenrely on other senses (e.g., smell). Animals may also only use vision, and other senses, for basic tasks such as detectingpredators/prey and for navigation. The ability to interpret the entire visual scene may be unique to humans and seem todevelop comparatively late (claims that 18 year old adults are still learning scene perception).Why is vision difficult? The world W is complex and ambiguous, see figure (1) – the world consists of many objects(20,000 based on estimates made by counting the names of objects in dictionaries) and plenty of ”stuff” (texture, vegetation,etc.). The mapping from the world W to the image I is very complex – the image is generated by light rays bouncing offobjects and reaching the eye/camera. Images are like an encoding of the world, but it is an encoding that is not designed forcommunication (unlike speech, or morse code, or telephones), see figure (2). Decoding an image I to determine the stateof the world W that caused it is extremely difficult. The difficulty was first appreciated when AI workers started trying todesign computer vision systems (originally thinking it would only take a summer).Another way to understand the complexity of images is by looking at how many images there can be. A typical images is1, 024 × 1, 024 pixels. Each pixel can take values 1 − 255. This gives a number of images to be (1, 024 × 1, 024)256whichis enormous (much much bigger than the number of atoms in the Universe). If you only consider 10 × 10 images, you findthere are more of them than can have been seen by all humans over evolutionary history (allowing 40 year for life, 30 imagesa second, and other plausible assumptions). So the space of images is really enormous.If vision is so hard then how is it possible to see? Several people (Gibson, Marr) have proposed ”ecological constraints”and ”natural constraints” which mean that the nature of the world provides constraints which reduce the ambiguities of images(e.g., most surfaces are smooth, most objects move rigidly). More recently, due to the growing availability of datasets (somewith groundtruth) it has become possible to determine statistical regularities and statistical constraints (as will be describedin this course). In short, there must be a lot of structure and regularity in images I, the world W , and their relationship whichcan be exploited in order to make vision possible.But how can we learn these structures/regularities? How can infants learn them and develop the sophisticated humanvisual system? How can researchers learn them and build working computer vision systems? It seems incredibly difficult, ifnot impossible, to learn a full vision system in one go. Instead it seems that the only hope is proceed incrementally learningthe simpler parts of vision first and then proceeding to the more complex parts. The study of infants (e.g., Kellman) suggeststhat visual abilities develop in an orchestrated manner with certain abilities developing within particular time periods – and,in particular, infants are able to perform motion tasks before they can deal with static images. Hence vision may be developedas a series of modules where easily learnable modules may be used as prequisites to train more complex modules.A.B.C. D.Figure 1. The Complexity and Ambiguity of Images. (A) The two images are of the same object (Dan Kersten’s canoe) but the intensityprofiles below (plots of the image intensity as a function of position) are very different. It would be very hard to look at these images(represented as plots) and determine that there is a canoe in each. (B) The face appears identical in the two bottom images, but the top twoimages show that one face is normal and the other is very distorted (more on this in the Lighting chapter). (C) Images of certain obkects(particularly of specular one – like polished metal) depend very strongly on the illumination conditions.Figure 2. Information Decoding. In standard decoding (e.g., telephone, morse code) the information is encoded and decoded by a systemthat knows the encoder. The encoding is designed to make transmitting and decoding signals efficient. In vision, the encoding is performedby the physics of the world and hence the decoding is considerably harder.But why is this incremental/modular strategy possible? Why is it possible to vision incrementally? (You cannot learn toride a bicycle or parachuting incrementally). Stuart Geman suggests this is because the structure of the world, and of images,is compositional (i.e., is build out of elementary parts). He provocatively suggests that if the world is not compositional thenGod exists! What he means that if the world/images are compositional (i.e. can be built up from elementary parts) then it ispossible to imagine that an infant could learn these models. But if the world is non-compositional, then it seems impossibleto understand how an infant could learnt at all (the idea of compositionality will be clarified as the course develops). Lessreligiously, Chomsky in the 1950’s proposed that language grammars were innate (i.e., specified in the genes) because ofthe apparent difficulty of learning them (it is known that children develop grammars even if they are not formally taught –e.g., children of slaves). But this raises the question of how genes could ”learn the grammars” in the first place. Others (e.g.,Figure 3. Sinha figure. The likelihood term P (I|W ) constrains the interpretation of I to scenes/objects which are consistent with it. Butthere remain many possibilities. The prior P (W ) is needed to give a unique interpretation by biasing towards those W which are mostlikely.Hinton) have questioned whether genes can


View Full Document

UCLA STATS 238 - Vision as Bayesian Inference

Download Vision as Bayesian Inference
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Vision as Bayesian Inference and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Vision as Bayesian Inference 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?