DOC PREVIEW
Berkeley COMPSCI 294 - Lecture Notes

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 294-5: StatisticalNatural Language ProcessingParsing: PCFGsDan KleinLearning vs. Inference There are two aspects to parsing: Learning: designing a good grammar. Coverage Ambiguity resolution Smoothing Inference: parsing with a given grammar. Runtime Memory load Exact or approximate / pruning? Today we’re only concerned with learning.Treebank Parsing in 20 sec Need a PCFG for broad coverage parsing. Can take a grammar right off the trees (doesn’t work well): Better results by enriching the grammar (e.g., lexicalization). We’ll show that lexicalization isn’t necessary for high-performance parsing.ROOT → S1S → NP VP . 1NP → PRP 1VP → VBD ADJP 1…..PCFGs and Independence The symbols in a PCFG define independence assumptions: At any node, the material inside that node is independent of the material outside that node, given the label of that node. Any information that statistically connects behavior inside and outside a node must flow through that node.NPSVPS → NP VPNP → DT NNNPNon-Independence I Independence assumptions are often too strong. Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects).11%9%6%NP PP DT NN PRP9%9%21%NP PP DT NN PRP7%4%23%NP PP DT NN PRPAll NPs NPs under S NPs under VPNon-Independence II Who cares? NB, HMMs, all make false assumptions! For generation, consequences would be obvious. For parsing, does it impact accuracy? Symptoms of overly strong assumptions: Rewrites get used where they don’t belong. Rewrites get used too often or too rarely.In the PTB, this construction is for possessives2Breaking Up the Symbols We can relax independence assumptions by encoding dependencies into the PCFG symbols: What are the most useful features to encode?Parent annotation[Johnson 98]Marking possessive NPsAnnotations Annotations split the grammar categories into sub-categories. Conditioning on history vs. annotating P(NP^S → PRP) is a lot like P(NP → PRP | S) P(NP-POS → NNP POS) isn’t history conditioning. Feature grammars vs. annotation Can think of a symbol like NP^NP-POS as NP [parent:NP, +POS] After parsing with an annotated grammar, the annotations are then stripped for evaluation.The Lexicalization Hammer Lexical heads important for certain classes of ambiguities (e.g., PP attachment): Lexicalizing grammar creates a much larger grammar. Sophisticated smoothing needed Smarter parsing algorithms More data needed How necessary is lexicalization? Bilexical vs. monolexical selection Closed vs. open class lexicalizationUnlexicalized PCFGs What do we mean by an “unlexicalized” PCFG? Grammar rules are not systematically specified down to the level of lexical items NP-stocks is not allowed NP^S-CC is fine Closed vs. open class words (NP^S-the) Long tradition in linguistics of using function words as features or markers for selection Contrary to the bilexical idea of semantic heads Open-class selection really a proxy for semantics Honesty checks: Number of symbols: keep the grammar very small No smoothing: over-annotating is a real dangerExperimental Setup Corpus: Penn Treebank, WSJ Accuracy – F1: harmonic mean of per-node labeled precision and recall. Size – number of symbols in grammar. Passive / complete symbols: NP, NP^S Active / incomplete symbols: NP → NP CC •23sectionTest:22 (first 20 files)sectionDevelopment:02-21sectionsTraining:Experimental Process We’ll take a highly conservative approach: Annotate as sparingly as possible Highest accuracy with fewest symbols Error-driven, manual hill-climb, adding one annotation type at a time3Horizontal Markovization Horizontal Markovization: Merges States70%71%72%73%74%0 1 2v 2 infHorizontal Markov Order0300060009000120000 1 2v 2 infHorizontal Markov OrderSymbolsMergedHorizontal Markovization70%71%72%73%74%0 1 2v 2 infHorizontal Markov Order030006000900012000012v2infHorizontal Markov OrderSymbolsOrder 1Order ∞Vertical Markovization Vertical Markov order: rewrites depend on past kancestor nodes.(cf. parent annotation)Order 1Order 272%73%74%75%76%77%78%79%12v23v3Vertical Markov Order050001000015000200002500012v23v3Vertical Markov OrderSymbolsVertical and Horizontal Examples: Raw treebank: v=1, h=∞ Johnson 98: v=2, h=∞ Collins 99: v=2, h=2 Best F1: v=3, h=2v012v2inf12366%68%70%72%74%76%78%80%Horizontal OrderVertical Order012v2inf1230500010000150002000025000SymbolsHorizontal OrderVertical Order7.5K77.8Base: v=h=2vSizeF1ModelUnary Splits Problem: unary rewrites used to transmute categories so a high-probability rule can be used.7.5K77.8Base8.0K78.3UNARYSizeF1Annotation Solution: Mark unary rewrite sites with -UTag Splits Problem: Treebank tags are too coarse. Example: Sentential, PP, and other prepositions are all marked IN. Partial Solution: Subdivide the IN tag.8.0K78.3Previous8.1K80.3SPLIT-INSizeF1Annotation4Other Tag Splits UNARY-DT: mark demonstratives as DT^U(“the X” vs. “those”) UNARY-RB: mark phrasal adverbs as RB^U(“quickly” vs. “very”) TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP) SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97] SPLIT-CC: separate “but” and “&” from other conjunctions SPLIT-%: “%” gets its own tag.9.0K81.69.1K81.78.1K80.48.1K80.58.5K81.29.3K81.8SizeF1Treebank Splits The treebank comes with annotations (e.g., -LOC, -SUBJ, etc). Whole set together hurt the baseline. Some (-SUBJ) were less effective than our equivalents. One in particular was very useful (NP-TMP) when pushed down to the head tag. We marked gapped S nodes as well.9.3K81.8Previous9.6K82.2NP-TMP9.7K82.3GAPPED-SSizeF1AnnotationYield Splits Problem: sometimes the behavior of a category depends on something inside its future yield. Examples: Possessive NPs Finite vs. infinite VPs Lexical heads! Solution: annotate future elements into nodes.9.7K82.3Previous9.8K83.1POSS-NP10.5K85.7SPLIT-VPSizeF1AnnotationDistance / Recursion Splits Problem: vanilla PCFGscannot distinguish attachment heights. Solution: mark a property of higher or lower sites: Contains a verb. Is (non)-recursive. Base NPs [cf. Collins 99] Right-recursive


View Full Document

Berkeley COMPSCI 294 - Lecture Notes

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?