CSCI 5832 Natural Language Processing Lecture 12 Jim Martin 2 27 07 CSCI 5832 Spring 2006 1 Today 2 27 Review Treebanks Parsing Break More on projects 2 27 07 CSCI 5832 Spring 2006 2 1 Avoiding Repeated Work Parsing is hard and slow It s wasteful to redo stuff over and over and over Grammars are ambiguous both locally and globally exacerbating the parsing problems 2 27 07 CSCI 5832 Spring 2006 3 CSCI 5832 Spring 2006 4 flight 2 27 07 2 flight flight 2 27 07 CSCI 5832 Spring 2006 5 2 27 07 CSCI 5832 Spring 2006 6 3 2 27 07 CSCI 5832 Spring 2006 7 Ambiguity For that example the problem was local ambiguity at the point a decision was being made the information needed to make the right decision wasn t there What about global ambiguity 2 27 07 CSCI 5832 Spring 2006 8 4 Ambiguity 2 27 07 CSCI 5832 Spring 2006 9 Ambiguity Local ambiguity means that we have to deal with multiple plausible choices during the parsing process Global ambiguity means that the grammar can t tell us which of several many possible parses is the correct one To deal with these problems we re going to Pursue all possible choices in parallel Store but not necessarily return all globally consistent parse trees 2 27 07 CSCI 5832 Spring 2006 10 5 Grammars Before you can parse you need a grammar So where do grammars come from Grammar Engineering Lovingly hand crafted decades long efforts by humans to write grammars typically in some particular grammar formalism of interest to the linguists developing the grammar TreeBanks Semi automatically generated sets of parse trees for the sentences in some corpus Typically in a generic lowest common denominator formalism of no particular interest to any modern linguist 2 27 07 CSCI 5832 Spring 2006 11 TreeBank Grammars Reading off the grammar The grammar is the set of rules local subtrees that occur in the annotated corpus They tend to avoid recursion and elegance and parsimony Ie they tend to the flat and redundant Penn TreeBank III has about 17500 grammar rules under this definition 2 27 07 CSCI 5832 Spring 2006 12 6 TreeBanks 2 27 07 CSCI 5832 Spring 2006 13 TreeBanks 2 27 07 CSCI 5832 Spring 2006 14 7 Sample Rules 2 27 07 CSCI 5832 Spring 2006 15 Example 2 27 07 CSCI 5832 Spring 2006 16 8 TreeBanks TreeBanks provide a grammar of a sort As we ll see they also provide the training data for various ML approaches to parsing But they can also provide useful data for more purely linguistic pursuits You might have a theory about whether or not something can happen in particular language Or a theory about the contexts in which something can happen TreeBanks can give you the means to explore those theories If you can formulate the questions in the right way and get the data you need 2 27 07 CSCI 5832 Spring 2006 17 Tgrep You might for example like to grep through a file filled with trees 2 27 07 CSCI 5832 Spring 2006 18 9 TreeBanks Finally you should have noted a bit of a circular argument here Treebanks provide a grammar because we can read the rules of the grammar out of the treebank But how did the trees get in there in the first place There must have been a grammar theory in there someplace 2 27 07 CSCI 5832 Spring 2006 19 TreeBanks Typically not all of the sentences are hand annotated by humans They re automatically parsed and then hand corrected 2 27 07 CSCI 5832 Spring 2006 20 10 Break Plan is to have everybody in a group and all the groups with projects by Friday We have a pretty good start on that already Google semeval 2007 and CONLL to get ideas on some interesting tasks 2 27 07 CSCI 5832 Spring 2006 21 Parsing We re going to cover from Chapter 12 CKY today Earley Thursday Both are dynamic programming solutions that run in O n 3 time CKY is bottom up Earley is top down 2 27 07 CSCI 5832 Spring 2006 22 11 Sample Grammar 2 27 07 CSCI 5832 Spring 2006 23 Dynamic Programming DP methods fill tables with partial results and Do not do too much avoidable repeated work Solve exponential problems in polynomial time sort of Efficiently store ambiguous structures with shared sub parts 2 27 07 CSCI 5832 Spring 2006 24 12 CKY Parsing First we ll limit our grammar to epsilonfree binary rules more later Consider the rule A BC If there is an A in the input then there must be a B followed by a C in the input If the A spans from i to j in the input then there must be some k st i k j Ie The B splits from the C someplace 2 27 07 CSCI 5832 Spring 2006 25 CKY So let s build a table so that an A spanning from i to j in the input is placed in cell i j in the table So a non terminal spanning an entire string will sit in cell 0 n If we build the table bottom up we ll know that the parts of the A must go from i to k and from k to j 2 27 07 CSCI 5832 Spring 2006 26 13 CKY Meaning that for a rule like A B C we should look for a B in i k and a C in k j In other words if we think there might be an A spanning i j in the input AND A B C is a rule in the grammar THEN There must be a B in i k and a C in k j for some i k j 2 27 07 CSCI 5832 Spring 2006 27 CKY So to fill the table loop over the cell i j values in some systematic way What constraint should we put on that For each cell loop over the appropriate k values to search for things to add 2 27 07 CSCI 5832 Spring 2006 28 14 CKY Table 2 27 07 CSCI 5832 Spring 2006 29 CKY Algorithm 2 27 07 CSCI 5832 Spring 2006 30 15 CKY Parsing Is that really a parser 2 27 07 CSCI 5832 Spring 2006 31 Note We arranged the loops to fill the table a column at a time from left to right bottom to top This assures us that whenever we re filling a cell the parts needed to fill it are already in the table to the left and below 2 27 07 CSCI 5832 Spring 2006 32 16 Example 2 27 07 CSCI 5832 Spring 2006 33 Other Ways to Do It Are there any other sensible ways to fill the table that still guarantee that the cells we need are already filled 2 27 07 CSCI 5832 Spring 2006 34 17 Other Ways to Do It 2 27 07 CSCI 5832 Spring 2006 35 Sample Grammar 2 27 07 CSCI 5832 Spring 2006 36 …
View Full Document
Unlocking...