Intro to Statistical ParsingPresenter: Benyah ShaparenkoCS 674, 3/14/2005CS 674, 3/14/2005 2Statistical Parsing Parsing + Statistics Choose “best” parseCS 674, 3/14/2005 3First Paper Eugene Charniak, “Statistical Parsing with a Context-free Grammar and Word Statistics” AAAI-1997CS 674, 3/14/2005 4Charniak 1997 Parser Penn Treebank Grammar, probabilities Uses probabilities for a good parse Exhaustive search impractical SmoothingCS 674, 3/14/2005 5Probabilistic Models: sentence, π: parse Probabilities calculated bottom-up System not guaranteed to find the best (highest probability) parseCS 674, 3/14/2005 6An Example ParseCS 674, 3/14/2005 7Probabilities of Constituents Constituent: e.g. np, vp, s 3 Steps (Top-down) 1. Calculate probability of head 2. Calculate probability of constituent given the head 3. Recurse down in parse treeCS 674, 3/14/2005 8Step 1: Head Probability s: head t: type of s h: head of parent of s l: type of parent of sCS 674, 3/14/2005 9Step 1: Example p(profits | rose, np, s) is based on… p(profits | rose, np, s) = 0 p(profits | class of rose, np, s) = .00352223 p(profits | np, s) = .0006274 p(profits | np) = .000556527CS 674, 3/14/2005 10Step 2: Pr(Constituent | Head) r: rule used for rewriting the constituent h, t, l as before Head, type, parent typeCS 674, 3/14/2005 11Step 2: Example r is (np -> adj plural-n) in “corporate profits” p(r | profits, np, s) is based on… p(r | profits, np, s) = .1707 p(r | profits, np) = .1875 p(r | class of profits, np) = .1192 p(r | np, s) = .0176 p(r | np) = .0255CS 674, 3/14/2005 12Algorithm Get grammar and stats from treebank Use only constituents likely to be in probable parses Based upon p(r | t) distribution Find best parse of probable parsesCS 674, 3/14/2005 13Experimental Setup Training: Sections 2-21 of Penn Treebank corpus (1 million words) Testing: Section 23 (50K words) Preliminary testing/tuning: Section 24CS 674, 3/14/2005 14Systems Tested PCFG: only use probability p(r | t) Minimal: use observed ṗ(r | h, t, l) No classes: all except ṗ’s with ch Basic: all equations as described earlier Full: Basic + 30M words used for unsupervised learningCS 674, 3/14/2005 15Metrics Labeled recall (LR): #right / #possible Labeled precision (LP): #right / #marked LR2/LP2 are LR/LP ignoring punctuation, and collapsing ADVP and PRT Crossing brackets CB: constituents violating correct boundaries CB0: no crossing brackets CB2: no more than 2 crossing bracketsCS 674, 3/14/2005 16ResultsCS 674, 3/14/2005 17Comparison w/ Previous WorkCS 674, 3/14/2005 18Comparison Explanations Non-factors p(s, π) vs. p(π | s) POS tags Formal/explicit grammar? Non-occurrence of grammar rules in data Factors: statistics and smoothing Decision tree vs. smoothing equations Word counts vs. classes Unsupervised data: ~.5% LR2CS 674, 3/14/2005 19Second Paper Michael Collins, “Three Generative, Lexicalised Models for Statistical Parsing” ACL/EACL-1997CS 674, 3/14/2005 20Three Generative Models Model 1: Collins 1996, but generative Generative: the same T maximizes both p(S, T) and p(T | S) Note: S is Charniak’s s, T is Charniak’s π Model 2: adds probabilistic complement/adjunct distinction Model 3: adds probabilistic wh-movementCS 674, 3/14/2005 21Notation PCFG: rewrite rules w/ probabilities Lexicalized PCFG P(h) -> Ln(ln)…L1(l1)H(h)R1(r1)…Rm(rm) X(x) with nonterminal X and <word, POS tag> pair x P: parent nonterminal H: head child of P, with h being head word Liand Riare modifiers of HCS 674, 3/14/2005 22An Example ParseCS 674, 3/14/2005 23Model 1 Step 1: Generate head probability pH(H | P, h) Steps 2,3: Generate left,right probability pL(Li(li) | P, h, H) pR(Ri(ri) | P, h, H) Uses 0-order Markov assumption Can use distance, e.g. pL(Li(li) | P, h, H, distancel(i-1))CS 674, 3/14/2005 24Model 1 Example Top rewrite rule in example parse Head prob.: pH(VP | S, bought) Left prob.: pL(NP(marks) | S, VP, bought) * pL(… Right prob.: pR(STOP | S, VP, bought)CS 674, 3/14/2005 25Model 2: Subcategorization Model 1 + complement/adjunct division Example…CS 674, 3/14/2005 26Model 2 Motivation Parsing info useful for marking complements May help accuracy Some rules for treebank data: must be NP, SBAR, S, etc under an S, cannot be ADV, VOC, etcCS 674, 3/14/2005 27Model 2 Probabilities Head probability same: pH(H | P, h) L/R subcat frames LC and RC w/ probs. plc(LC | P, h, H) and prc(RC | P, h, H) Frames specify needed complements… L/R probabilities depend on LC/RC, e.g. pL(Li(li) | P, h, H, distancel(i-1), LC)CS 674, 3/14/2005 28Model 2 Example pL(NP(marks) | S, VP, bought)*pL(NP(week) | S, VP, bought)*pL(STOP | S, VP, bought) becomes… plc({NP-C} | S, VP, brought)*pL(NP-C(marks) | S, VP, bought, {NP-C})*pL(NP(week) | S, VP, bought, {})*pL(STOP | S, VP, bought, {})CS 674, 3/14/2005 29Model 3: Wh-Movement Refers to the effects of a wh-word, e.g. which, where, on a clause Parsing solution: using a +gap feature which must be matched with a TRACE 1. +gap passed to head of phrase 2. +gap passed to L/R modifiers or output as a TRACECS 674, 3/14/2005 30Model 3 Probabilities pG(G | P, h, H) G either Head, Left, or Right If G = Head, propagate +gap to head If G = Left/Right, add +gap to Left/Right subcat variableCS 674, 3/14/2005 31Implementation Details Smoothing used Backoff… pretty standard Unknown words Words < 5 times replaced by “UNKNOWN” POS tags Only use tags that appear in training dataCS 674, 3/14/2005 32Experimental Results Setup: same as Charniak Metrics: Same, except only “LR2” and “LP2,” which are now called LR and LPCS 674, 3/14/2005 33The Numbers…CS 674, 3/14/2005 34Still Not the End of the Story… Collins (1998) applied techniques for “semantic tagging” Management succession: outgoing manager, new manager, the position Charniak (2000) made an max entropy parser Just over 90% LP /
View Full Document