DOC PREVIEW
UCLA STATS 238 - Lecture 1

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Course: Vision as Bayesian Inference. Lecture 1Alan YuilleDepartment of Statistics, UCLALos Angeles, CA [email protected] is Difficult. Probabilistic Approach. Edge Detection Example. Factorized Models.NOTE: NOT FOR DISTRIBUTION!!1 IntroductionWhat is vision? Image I is generated by the state of the world W . You need to estimate W from I.Vision is very very hard. Humans are ”vision machines”. The large size of our cortex distinguishes us from otheranimals (except close relatives like monkeys). Roughly half the cortex is involved in vision (maybe sixty percent formonkeys). It is possible that the size of the cortex developed because of the evolutionary need to see and interpret theworld. So intelligence might be a parasite of vision. Most animals have poor vision – e.g., cats and cobras can onlysee movement – and often rely on other senses (e.g., smell). Animals may also only use vision, and other senses, forbasic tasks such as detecting predators/prey and for navigation. The ability to interpret the entire visual scene maybe unique to humans and seem to develop comparatively late (claims that 18 year old adults are still learning sceneperception).Why is vision difficult? The world W is complex and ambiguous, see figure (1) – the world consists of many objects(20,000 based on estimates made by counting the names of objects in dictionaries) and plenty of ”stuff” (texture,vegetation, etc.). The mapping from the world W to the image I is very complex – the image is generated by light raysbouncing off objects and reaching the eye/camera. Images are like an encoding of the world, but it is an encoding thatis not designed for communication (unlike speech, or morse code, or telephones), see figure (2). Decoding an imageI to determine the state of the world W that caused it is extremely difficult. The difficulty was first appreciated whenAI workers started trying to design computer vision systems (originally thinking it would only take a summer).Another way to understand the complexity of images is by looking at how many images there can be. A typical imagesis 1, 024 × 1, 024 pixels. Each pixel can take values 1 − 255. This gives a number of images to be (1, 024 × 1, 024)256which is enormous (much much bigger than the number of atoms in the Universe). If you only consider 10 × 10images, you find there are more of them than can have been seen by all humans over evolutionary history (allowing 40year for life, 30 images a second, and other plausible assumptions). So the space of images is really enormous.If vision is so hard then how is it possible to see? Several people (Gibson, Marr) have proposed ”ecological constraints”and ”natural constraints” which mean that the nature of the world provides constraints which reduce the ambiguities ofimages (e.g., most surfaces are smooth, most objects move rigidly). More recently, due to the growing availability ofdatasets (some with groundtruth) it has become possible to determine statistical regularities and statistical constraints(as will be described in this course). In short, there must be a lot of structure and regularity in images I, the world W ,and their relationship which can be exploited in order to make vision possible.But how can we learn these structures/regularities? How can infants learn them and develop the sophisticated humanvisual system? How can researchers learn them and build working computer vision systems? It seems incrediblydifficult, if not impossible, to learn a full vision system in one go. Instead it seems that the only hope is proceedA.B.C. D.Figure 1: The Complexity and Ambiguity of Images. (A) The two images are of the same object (Dan Kersten’scanoe) but the intensity profiles below (plots of the image intensity as a function of position) are very different. Itwould be very hard to look at these images (represented as plots) and determine that there is a canoe in each. (B) Theface appears identical in the two bottom images, but the top two images show that one face is normal and the other isvery distorted (more on this in the Lighting chapter). (C) Images of certain obkects (particularly of specular one – likepolished metal) depend very strongly on the illumination conditions.Figure 2: Information Decoding. In standard decoding (e.g., telephone, morse code) the information is encoded anddecoded by a system that knows the encoder. The encoding is designed to make transmitting and decoding signalsefficient. In vision, the encoding is performed by the physics of the world and hence the decoding is considerablyharder.incrementally learning the simpler parts of vision first and then proceeding to the more complex parts. The study ofinfants (e.g., Kellman) suggests that visual abilities develop in an orchestrated manner with certain abilities developingwithin particular time periods – and, in particular, infants are able to perform motion tasks before they can deal withstatic images. Hence vision may be developed as a series of modules where easily learnable modules may be used asprequisites to train more complex modules.But why is this incremental/modular strategy possible? Why is it possible to vision incrementally? (You cannot learnto ride a bicycle or parachuting incrementally). Stuart Geman suggests this is because the structure of the world,and of images, is compositional (i.e., is build out of elementary parts). He provocatively suggests that if the world isnot compositional then God exists! What he means that if the world/images are compositional (i.e. can be built upfrom elementary parts) then it is possible to imagine that an infant could learn these models. But if the world is non-compositional, then it seems impossible to understand how an infant could learnt at all (the idea of compositionalitywill be clarified as the course develops). Less religiously, Chomsky in the 1950’s proposed that language grammarswere innate (i.e., specified in the genes) because of the apparent difficulty of learning them (it is known that childrendevelop grammars even if they are not formally taught – e.g., children of slaves). But this raises the question ofhow genes could ”learn the grammars” in the first place. Others (e.g., Hinton) have questioned whether genes canconvey enough information to specify grammars (at least for vision). Recent computational work, however, (Kleinand Manning) has shown that grammars can, in fact, be learnt in an unsupervised manner


View Full Document

UCLA STATS 238 - Lecture 1

Download Lecture 1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?