CORNELL CS 630 - From Sampling to Model Counting

Unformatted text preview:

CS630 Lecture 19:Modeling syntactic structure: context-free grammarsDate: April 11th, 2006Lecturer: Lillian LeeScribes: Ari Rabkin and Victoria KrafftToday we’ll be looking at syntactic structure, specifically as modeled by Context FreeGrammars (CFGs).Contents1 CFGs 12 Problems with English 43 Questions 51 CFGsUp until now, we’ve only talked about structure within the corpus, and ignored structurewithin the document. It didn’t matter what order the words in the document were in; thedocument was regarded as a bag of words. Today, we’ll begin moving away from that.We’ll begin by looking at the structure of a sentence. We can assign a hierarchical (andordered) constituent analysis. The constituents will be represented as subtrees. For an ex-ample of this, see the figure below, which was on the handout in class. The entire sequenceof words can be assigned the label S, meaning that it forms a constituent of type ”sentence”.This is then split up into a noun phrase NP followed by verb phrase VP. ( We will be usingthese syntactic categories rather than the more “functional” subject/predicate categories.)This structure c an also be represented using brackets, where each sequence of wordsmaking up a constituent is surrounded by a pair of brackets with a label. Brackets may beembedded, but may not overlap.The smallest unit that a word will be part of is a preterminal, also known as a part ofspeech. For example, a noun, verb, or preposition would be a preterminal.We observe “X-bar-like” patterns, where a phrase XP always has a subconstituent ofcategory X, which we call the “head” of the phrase. The head’s features “propagate” to theXP. So, a noun phrase NP will always contain a noun N, and a verb phase VP will alwayscontain some verb V, and if the head V is a past-tense verb, then the VP is also past-tense.1Figure 1: A sentence divided up into its constituentsIn general, two constituents with the same label can be substituted for each other. Forexample, in bracke t notation, we have:[[police]NP[[put]V[barricades]N[[around]Pr[upson]N]PrP]VP]SWith substitutability, this can become:[[police]NP[[put]V[daffodils]N[[around]Pr[upson]N]PrP]VP]Sor[[police]NP[[put]V[barricades]N[[near]Pr[upson]N]PrP]VP]SWe do not require these trees to be binary1. If we try to create a binary tree using place-holder categories, then problems arise. Consider a “movement-based argument”. We willprovisionally assume that if parts of a sentence can be rearranged, those parts are meaningfulconstituents. Examples I(c), I(d) and I(e) below demonstrate the results of this. In I(c), wemoved ‘barricades’ and changed the sentence to the interrogative form, which produces areasonable sentence in English. In I(d), we did the same thing, moving and changing a prepo-sitional phrase. Example I(e) shows that when binary trees are used, so that “barricadesaround upson” is presumed to form a constituent, the result does not produce reasonablesentence s in English.Examples:I (c) : [What] will police [[put] [around upson]]1Contrast the use in formal language theory of the Chomsky normal form.2I (d) : [Where] will police [[put] [barricades]]I (e) : [What] [where] will police [[put]] / [What] will police [[put]]Can we use context-free grammars (CFGs) to describe exactly the set of legal syntacticanalyses, and in particular, capture important regularities?A CFG consists of the following elements:• A finite set of terminals (lexical items such as words)• A finite set of nonterminals (constituent labels, including a distinguished set of preter-minals); this set is disjoint from the set of terminals.• A distinguished start symbol S• A finite set of rewrite rules or productions for the subconstituent expansions that areallowed. For example, we might allow S → NP VP.Typically preterminal expansions, such as N → police, are stored separately in a lexicon(dictionary). This lexicon specifies, at a minimum, what part of speech each word in thevocabulary is. (Later we will see that much more information can be stored there as well.)Separating the lexicon from the “true grammar” makes it easier to add words to the vocab-ulary without modifying all the rules.To generate a tree from a CFG, one does the following:1. Start with the “degenerate” tree with a single node, labeled with the start symbol S.2. Expand a leaf nonterminal according to the productions of the CFG.3. Repeat above step until only terminals remain..A CFG that generates Ib must contain, at a minimum, the following productions:S → NP VPVP → V NP PrPPrP → Pr NNP → NN → policeN → upsonN → barricadesV → putPr → aroundHowever, these productions can always be applied regardless of the surrounding contextof the nonterminal to be decomposed (hence the name “context-free”, so we could generate“upson put upson around upson” as well. Any CFG which generates a tree where a given3nonterminal appears more than once with a different sequence of terminals at the leaves canproduce multiple trees, since subtrees with identical root nodes c an be replaced with oneanother.2 Problems with EnglishSo far, there is no CFG that generates all the legal syntactic structures of English whileonly generating legal syntactic structures of English (even assuming that such a thing iswell- defined). Indeed, there have been arguments made that CFGs are not powerful enoughto model all natural-language phenomena. Regardless of the theoretical issues, there aredefinitely a variety of practical issues when it comes to creating a CFG for English.1. Category proliferation: Example II on the handout shows several instances of thisproblem.IIa: police [ [informed]V[the president]NP[that students had hired lawyers]S0]VPThis is a legitimate sentence. Therefore, we might imagine that VP → V NP S0is an accurate general rule for English (here we are using S’ to mean “clause”).IIb: *2police [informs]Vthe president that students had hired lawyersIn this sentence, we have replaced the verb “informed” with the verb “informs”.This demonstrates that the posited rule allows agreement mismatch: “informs”is singular, but “police” is plural.IIc: * police informed [she]NPthat students had hired lawyersThe NP “the president” has been replaced with the NP “she”. This demonstratesthat the posited rule allows a case mismatch, which is one instance of a subcate-gorization error. The NP “she” is a direct object, so it must be


View Full Document

CORNELL CS 630 - From Sampling to Model Counting

Download From Sampling to Model Counting
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view From Sampling to Model Counting and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view From Sampling to Model Counting 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?