MIT 6 863J - Lecture Notes - D451560

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> Lecture Notes

DOC PREVIEW

MIT 6 863J - Lecture Notes

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 72

This preview shows page 1-2-3-4-5-34-35-36-37-68-69-70-71-72 out of 72 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lecture 20: From language acquisition to language change Professor Robert C. BerwickMenu • Administrivia: Project endgame • Why Bayesian CFG learning goes astray • How we might repair it: the ‘classical’ linguistic methodology • From the logical problem of language acquisition to the illogical problem of language change 6.863J SP11“walking on ice” 6.863J SP11 (A) Is the right structure. Why? Can a stochastic CFG learning algorithm find (A), rather than the other structures? In fact, this turns out to be hard. The SCFG picks (E)! Why? Entropy of (A) turns out to be higher (worse) than (E)-(H). Learner that uses this will go wrong.Some terminology • Entropy of a discrete random variable A, denoted H(A) • Cross-entropy between 2 distributions pA, pB, a measure of how well distribution pB predicts A, a minimum when distributions of A and B are identical: • So if B is a PCFG, and A a corpus, this is a measure of how well the grammar models the corpus 6.863J SP11More terminology • Joint entropy and conditional entropy of two random variables A, B: • So this measures uncertainty in joint distribution & then the uncertainty of A given knowledge of B • H(A,B) = H(A)+H(B|A)=H(B)+H(A|B) • Finally, we have 6.863J SP11Terminology • Mutual information between A, B, a measure of the dependence between the two, 0 iff A, B independent; o.w. between 0 and min(H(A), H(B)) • Note then that: H(A,B) = H(A)+H(B|A) =H(A)+H(B)–I(A,B) 6.863J SP11Why is the entropy higher for the wrong structure? 6.863J SP11 Consider a similar example from word strings: Theguysawthedog We can find the ‘break’ between the and guy BUT what about: Shewaspassedbytherunner This will yield the incorrect ‘word’ edby because ed often precedes by in English Similarly, verbs often followed by prepositions, because PPs adjoined to VPs…Generating this word sequence with a grammar 6.863J SP11 Now imagine a grammar generating the two-word sequence AB, via some rule X→AB. We first generate A, chosen from distribution pA, and then word B, from distribution pB* conditioned on pA Q: can we get structure below? But in English, verbs+prepositions are more closely coupled than prepositions+nouns, semantically So we expect the mutual information between the verb and the preposition to be greater than the mutual information between the preposition and the noun, and greater still than between the verb and the noun: I(V,P) > I(P,N) > I(V,N)Recall: (E) is preferred to (A) 6.863J SP11Why structure (E) is preferred to (A) Entropy for structure (A) is higher than that for (E) For structure (A), the entropy is: (1) H(V)+H(P)+H(N|P)=H(V)+H(P)–I(P,N) For structure (E), entropy is: (2) H(V)+H(P)+H(P|V)=H(V)+H(P)–I(V,P) But since we assumed I(V,P) > I(P,N), so (2) beats (1)! 6.863J SP11In general, this happens even with simple rule sets and grammars: convergence to suboptimal grammar Example (‘head’ grammar) S→ AP S→ CP BP→B CP AP→ A BP CP→ AP C BP→AP B AP→ A CP CP→ BP C BP→ B AP→ A CP→C 3 ‘parts of speech’, A, B, C; To model V, P, N: Assume: I(A,B)=1; I(B,C)=0.188; I(A,C)=0 Assume uniform rule probabilities initially sentence: ABC 6.863J SP11Here’s what EM picks 6.863J SP11 It is ‘attracted’ to the wrong structure with suboptimal entropy (likelihood) value…let’s see why4 possible parses of “ABC” Initially, each of these parses has the same posterior probability: Because we start with same pr’s, and each parse uses the same number of rule expansions, and, finally, the bigrams are uniform at the start. The estimated probability of a rule after the first pass is directly proportional to how many of these parse trees the rule occurs in The rules that occur more than once are: AP → A BP (parses I, K) CP → BP C (parses J, L) BP→ B (parses J,K) J, K have twoWhy doesn’t EM move to global optima from J to I, L? If system starts at J, why can’t it move to K, I, L? Answer: look at what has to change: 3 rules must have their nonterminals switched (J) (L) qA: AP → A BP AP → A qB: BP→ B BP→ AP B qC: CP→ AP C CP→ BP C 6.863J SP11Update equation for the q rule probabilities 6.863J SP11 Parse K Parse L AP → A BP BP → B AP → A BP → AP BBut there’s another way to look at search by Bayesian methods…. 6.863J SP11We want the best (highest probability) Gi given the data sequence Sj 6.863J SP11 p(Gi| Sj) =p(Gi) × p(Sj| Gi)p(Sj)=arg maxp(Gi) × p(Sj| Gi)p(Sj)=arg maxp(Gi) × p(Sj| Gi) (since Sj constant)And we can compute this! We just need to ‘search’ through all the grammars and find the one that maximizes this… can this be done? Horning has a general result for unambiguous CFGs; for a more recent (2011) approach that works with 2 simple grammar types & child language – see Perfors et al. Note: again, the G’s only ‘approach’ the best G with increasing likelihoodAnother view of this ‘maximize posterior probability’ view 6.863J SP11 =arg maxp(Gi) × p(Sj| Gi)This is usually called minimum description (MDL) We want to find the shortest (smallest) grammar plus the encoding of the data using that grammar Now let’s assume: (1) that p(Gi)∝ 2–|Gi| so that smaller grammars are more probable; (2) by Shannon’s source coding theorem, optimal encodings of the data Sj wrt grammar Gi approaches –log2 p(Sj | Gi) Then maximizing this ‘posterior probability’ becomes, after taking log2, is equivalent to finding the minimum of: |Gi|+|S| with S coded by Gi, , which, by Shannon’s coding thm is equivalent to minimizing |Gi| –log2 p(Sj | Gi)• Most restrictive grammar just lists all possible utterances à Only the observed data is grammatical, so it has a high probability • A simple grammar could be made that allowed any sentences à Grammar would have a high probability à But data a very low one MDL finds a middle ground between always generalizing and never generalizing 6.863J SP11Complexity and Probability • More complex grammar à Longer coding length, so lower probability • More restrictive grammar à Fewer choices for data, so each possibility has a higher probability 6.863J SP11Minimum description length as a criterion has a long pedigree… 6.863J SP11 Chomsky, 1949, Morphophonemics of Modern Hebrew So, this MDL criterion was there from the start:

View Full Document