Lecture 8: Parsing tricks; beyond context-free grammarsProfessor Robert C. [email protected]/9.611J SP11The Menu Bar• Administrivia: Lab 3…• The Earley algorithm• Some speedup tricks• How do the actual ‘large scale systems’ do?• Pluses and minuses of context-free grammars6.863J/9.611J SP11Earley parsing• Top down instead of bottom-up• Uses left-context and optionally right-context toconstrain search so we waste less time buildingimpossible things• No restrictions of the form of the grammare.g., NP→ Det N with John PP is an OK rule, thanksto a clever ‘on the fly’ transformation to binarybranching rules (“dotted rules”)• Again O(n3) time (but watch out for grammar size!)6.863J/9.611J SP11Overview of Earley’s Algorithm• Finds constituents and partial constituents in input• A → B C . D E is partial: only the first half of the AAB C D EA → B C . D ED+=AB C D EA → B C D . E6.863J/9.611J SP11Overview of Earley’s Algorithm• Proceeds incrementally, left-to-right• Before it reads word 5, it has already built all hypotheses thatare consistent with first 4 words• Reads word 5 & attaches it to immediately precedinghypotheses. Might yield new constituents that are thenattached to hypotheses immediately preceding them …• E.g., attaching D to A → B C . D E gives A → B C D . E• Attaching E to that gives A → B C D E .• Now we have a complete A that we can attach to hypothesesimmediately preceding the A, etc.6.863J/9.611J SP11Motivation• We will carry out a deterministic simulation of anondeterministic machine that can keep track of ‘all thepossible next states’ Si it can be in after reading each word• So we imagine a sequence of state sets, S0, S1…• Representation as a ‘chart’ of ‘columns’; Column i orchart(i) will contain all of the possible machine states afterreading the ith word (NB. Indexing: we start at 0, before thefirst word of the sentence)• Compare this to how we can solve ‘fruit flies like abanana’ in a nondeterministic finite-state automaton…6.863J/9.611J SP11FSA analog‘fruit flies like a banana’fruit (Adj)fruit (Noun)0123flies(Noun)like (Verb)flies(Verb)like (Adv)abanana456a6.863J/9.611J SP11Parsing with State Sets‘fruit flies like a banana’S0S1 ={1,2} S2{2,3}‘Scan’fruitflieslikeS3 {3,4}aS4 {5}S5 {6}banana6.863J/9.611J SP11State set construction as sequence of machinestates, extending paths…Initial: state set S0 = where machine couldbe after reading zero wordsApply all edge operations until closureS0S1Sn6.863J/9.611J SP11State Set machine• Need to specify initial state• Need to specify how to construct State Set Si, given StateSet Si-1• Need to specify final state (accept or reject)• For getting from one state to next, need only calculate howto scan under closure, word by word• For the linear, or finite-state case, this is all easy• For the hierarchical, or context-free case, we need a morecomplex representation of ‘state’, and how to get from onestate set to the next6.863J/9.611J SP11Table, or ‘chart’ as a sequence of State SetsColumn j = holds a State SetWhat is a “state”? Now we need a bit more info…321j0 3eleph → •NP → eleph•NP → •Det Ni 2VP → V NP •shot → •V → • shotVP → V • NPVP → • V NPV → • shot 1S → NP VP •I → •NP →I •S → NP • VPS → • NP VPNP → • I 06.863J/9.611J SP11But not much more info…Earley’salgorithm• Use table as before; but 1 column per word as ‘bag’ (a ‘chart’)• The column entries are called states or items and are represented withdotted-rules, plus information about how much of the rule has been found.A dotted rule is of 3 types:S → • VP A VP is predictedNP → Det • Noun An NP is in progressVP → V NP • A VP has been foundTo the dotted rule we add 2 indices that denote where the phrase on the left ofthe arrow starts (its left edge), and how far to the right the phrase has beenconstructed so farThese two elements constitute a ‘state’So, e.g., [NP → Det • Noun, (1, 2)] means that we started building an NP atword 1 of the input, and we’ve found the first word, a Det.A collection of states in a given column (e.g., col. 2), is called a State Set6.863J/9.611J SP11Dotted rules: run-time conversion tobinary branching• A VP is predicted at the start of thesentence• An NP is in progress; the Det goesfrom 1 to 2• A VP has been found starting at 0and ending at 3• S → VP [0,0]• NP → Det Noun [1,2]• VP → V NP [0,3]6.863J/9.611J SP11What are…• The initial state? S0={Start→•S, [0,0]}• The final state? Sn={Start →S •, [0,n]}• How do we get from one State Set to the next?We apply three operations on the items in a StateSet until closureNot just ‘scan’, but 2 more to start and stop thebuilding of trees6.863J/9.611J SP11Earley’s algorithm uses 3 basic operations,corresponding to how we construct ‘trees’Predicter (push)ScannerCompleter (attach, pop)6.863J/9.611J SP11Earley’s algorithm (from textbook)6.863J/9.611J SP11The 3 operations: Predictor (‘wishor’);Scanner, Completer6.863J/9.611J SP11Algorithm is just a loop around this:1. Add Start → • S to column 02. For each j from 0 to n:i. For each dotted rule in column j (including those weadd as we go), look at what’s after the dot:a. If word, Scanb. If nonterminal, Predict (until closure)c. If nothing after dot, Complete (until closure)ii. Output ‘yes’ if Start → S • is in column nNote: Don’t add duplicates! (AWP; but this is tricky to do inpractice!)Question: where is the stack in all of this? (Since we need astack…)6.863J/9.611J SP11An Example…ROOT → SS → NP VP NP → PapaNP → Det N N → caviarNP → NP PP N → spoonVP → VP PP V → ateVP → V NP P → withPP → P NP Det → theDet → aPapa ate the caviar with a spoon6.863J/9.611J SP11Time complexity• Algorithm iterates for every word of the input (n iterations)• How many items can be created and processed in Si?- Each item in Si has the form [A→ X1…•C…Xm,j], 0≤ j ≤ i- Thus O(n) items; better:• The scanner and predictor operations on an item take at worstconstant time (Why?)• The completer operations on an item adds items of the form[B→ X1…•A…Xm,l] to Si with 0≤ l ≤ i , so it may require up toO(n) time for each processed item• Time required for each iteration Si is thus
View Full Document