MIT 16 412J - Visual Interpretation using Probabilistic Grammars - D2271560

Home> Schools> Massachusetts Institute of Technology> (16) > 16 412J> Visual Interpretation using Probabilistic Grammars

DOC PREVIEW

MIT 16 412J - Visual Interpretation using Probabilistic Grammars

School name Massachusetts Institute of Technology

Course 16 412j- Cognitive Robotics

Pages 42

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Visual Interpretation using Probabilistic Grammars Paul Robertson2Model-Based Vision• What do the models look like• Where do the models come from• How are the models utilized3The Problem4567Optimization/Search ProblemFind the most likely interpretation of the image contents that:1. Identifies the component parts of the image correctly.2. Identifies the scene type.3. Identifies structural relationships between the parts of the image.Involves: Segmenting into parts, naming the parts, and relating the parts.8Outline• Overview of statistical methods used in speech recognition and NLP• Image Segmentation and Interpretation– image grammars– image grammar learning– algorithms for parsing patchwork images.9Not any description – the bestsvpnpnp npnoun nounverbnounswat flies like antssvpppnpnoun nounprepverbswatflies like antsnpBad parse Good parse10What’s similar/different between image analysis and speech recognition/NLP?• Similar– An input signal must be processed.– Segmentation.– Identification of components.– Structural understanding.• Dissimilar– Text is a valid intermediate goal that separates Speech recognition and NLP. Line drawings are less obviously useful.– Structure in images has much more richness.11Speech Recognition and NLP• Little backward flow• Stages done separately.• Similar techniques work well in each of these phases.• A parallel view can also be applied to image analysis.Segmentation into wordsPart ofspeechtaggingSentenceParsingSpeech Recognition Natural Language Processing12Speech Understanding• Goal: Translate the input signal into a sequence of words.– Segment the signal into a sequence of samples.•A = a1, a2, ..., amai ∈– Find the best words that correspond to the samples based on:• An acoustic model. – Signal Processing– Prototype storage and comparator (identification)• A language model.•W = w1, w2, ..., wmwi∈–Wopt= arg maxwP(W|A)–Wopt=arg maxwP(A|W) P(W)• (since P(W|A) = P(A|W) P(W) / P(A) [Bayes])• P(A|W) is the acoustic model.• P(W) is the language model.13language modeling for speech• Using the above– P(W) can be represented as a HMM and solved efficiently using the Viterbi algorithm.– The good weights λ1, λ2, and λ3can be computed using the Baum-Welch algorithm.1)()|(),|(),|(),|(),|()|()()),...,(|()(),...,()(32111221321212111111111|=++++==Φ=Φ==−−−−−−−−−=−=−=−∏∏∏λλλλλλiiiiiiiiiiiiiiiniiiniiiniiiwfwwfwwwfwwwPwwwfwwwPwPWPwwwPWPwwwPWP14Natural Language Processing• Part of correctly understanding a sentence comes from correctly parsing it.• Starting with a word list, parsing involves two separable activities:– Part of speech tagging.• Find the most probable assignments of parts of speech.– Parsing the words into a tree.• Find the most probable parse tree.svpnpnp npnoun nounverbnounswat flies like antssvpppnpnoun nounprepverbswat flies like antsnp15Part-of-speech tagging• Goal: Assign part-of-speech tags to each word in the word sequence.– Start with the word sequence•W = w1, w2, ..., wmwi∈– Find the best tags for each word•T = t1, t2, ..., tmti∈16∑∏∑∏∑+++=−+=+−−−−+=======1,11,1,1,11,111,111111,11,1,1,1,1,1,1,1,11,1,11)|()|(),1()|()|(),()|(),|()|()|(),(maxarg)|(maxarg),(),(nnnnntniiiiiitniiiiinnnnnnnnnnnntoptnntopttnntttPtwPnwPttPtwPnwPttPtwtPtwPtwwPwtPTwtPTtwPnwP•Topt is the path the HMM traverses in producing the output (since the states of the HMM are the tags).•Use Viterbi algorithm to find the path.17PCFG’s• Better language models lead to better results.• Considering the grammar instead of a simple sequence of words, the relationships are more meaningful.• PCFG is <W, N, N1, R>– W is a set of terminal symbols– N is a set of non-terminal symbols–N1is the starting symbol– R is a set of rules.• Each rule Ni→RHS has an associated probability P(Ni→RHS) which is the probability of using this rule to expand Ni• The probability of a sentence is the sum of the probabilities of all parses.• Probability of a parse is the product of the probabilities of all the productions used.• Smoothing necessary for missing rules.18Example PCFG• Good parse = .2x.2x.2x.4x.4x1.0x1.0x.4x.5 = 0.000256• Bad parse =.8x.2x.4x.1x.4x.3x.4x.4x.5 = 0.00006144s → np vp 0.8s → vp 0.2np → noun 0.4np → noun pp 0.4np → noun np 0.2vp → np vp 0.3vp → np vp 0.3vp → np vp 0.2vp → np vp 0.2pp → prep np 1.0prep→ like 1.0verb→ swat 0.2verb→ flies 0.4verb→ like 0.4noun→ swat 0.1noun→ flies 0.4noun→ ants 0.519Why these techniques are dominating language research• Statistical methods work well– The best POS taggers perform close to 97% accuracy compared to human accuracy of 98%.– The best statistical parsers are at around 88% vs an estimated 95% for humans.• Learning from the corpus– The grammar can be learned from a representative corpus.• Basis for comparison– The availability of corpora with ground truth enables researchers to compare their performance against other published algorithms/models.• Performance– Most algorithms at runtime are fast.20Build Image Descriptions21Patchwork Parsing• Use semantic segmentation to produce a set of homogeneous regions • Based on the contents of the regions and their shape hypothesize region contents.• Region contents is ambiguous in isolation– Use contextual information to reduce ambiguity.• The image must make sense– We must be able to produce a parse for it.• Our interpretation of the image approximates the most probable parse.– Success of the picture language model determines whether most-probable-parse works.• Do it (nearly) as well as human experts22LakeFieldFieldFieldSwampSwampTownRoadRiverr1r2r3r4r5r6r7r8r923Segmented image labeling• The image contains n regions r1,n.• Each region has a set of neighbors n1,n.•P(r1,n) is the sum of the disjoint labelings.∑=nlnnnlrPrP,1),()( ,1,1,124∏∏∏∏========niiiiilniiiiiiilniiiiiiiilniiiilnlnPrlPrnPlnPrlPrnPrlnPrlPnrlPLnnnn1111,1)|()|(maxarg)|()|()|(maxarg)|(),|()|(maxarg),|(maxarg,1,1,1,1•P(li|ri) is the optical model.•P(ni|li) is the picture language model.• We wish to find the labeling L1,n.25Segmentation26The

View Full Document