Unformatted text preview:

Lecture 2: Context Free GrammarsIntroduction to Natural Language ProcessingCS 585Fall 2004Andrew McCallumAlso includes material from Chris Manning.September 14, 2004Today’s Main Points• Demo of “hands on” with text, using Unix toolsZiph’s Law• A brief introduction to syntax in NLP.• Define context free grammars.Give some examples.• Chomsky normal form. Converting to it.• Parsing as searchTop-down, bottom up, and the problems with each.• Hand out homework #1.• Read M&S Ch 3, if you haven’t already.Administration• I will be giving research presentations in D.C. next Tuesday & Thursday.• Tuesday: Wei Li with review of probability• Thursday: Aron Culotta: information theory and naive Bayes.• Be sure to subscribe to cs585 mailing list, if you haven’t already.I sent a test message Sunday night.“Hands on” with Text• Get a large body of electronic text (called a corpus).• Browse it.• Use perl and other Unix tools to look at counts of words.What does this distribution look like?Messy, but the following works:cat * | perl -pe ’s/[^a-zA-Z]/\n/g’ | egrep ’.’ \| awk ’{print tolower($0)}’ \| sort | uniq -c | sort -nr | lesscat sawyr10.txt | perl -pe ’s/[^a-zA-Z]/\n/g’ | egrep ’.’ \| awk ’{print tolower($0)}’ \| sort | uniq -c | sort -nr | \perl -pe ’s/[^0-9\n]//g’ | uniq -c | lessWord frequencies in Tom SawyerWord Freq Usethe 3332 determiner (article)and 2972 conjunctiona 1775 determinerto 1725 preposition, verbal infinitive markerof 1440 prepositionwas 1161 auxiliary ve rbit 1027 (personal/expletive) pronounin 906 prepositionthat 877 complementizer, demonstrativehe 877 (personal) pronounI 783 (personal) pronounhis 772 (possessive) pronounyou 686 (personal) pronounTom 679 proper nounwith 642 prepositionFrequencies of frequencies in Tom SawyerWord Frequency Frequency of Frequency1 39932 12923 6644 4105 2436 1997 1728 1319 8210 9111-50 54051-100 99>100 10271,730 word tokens8,018 word typesZiph’s Law in Tom SawyerWord Freq. (f) Rank (r) f * rthe 3332 1 3332and 2972 2 5944a 1775 3 5235he 877 10 8770but 710 20 8400be 294 30 8820there 222 40 8880one 172 50 8600about 158 60 9480more 138 70 9480never 124 80 9920Oh 116 90 10440two 104 100 10400Ziph’s Law in Tom SawyerWord Freq. (f) Rank (r) f * rturned 51 200 10200youll 30 300 9000name 21 400 8400comes 16 500 8000group 13 600 7800lead 11 700 7700friends 10 800 8000begin 9 900 8100family 8 1000 8000brushed 4 2000 8000sins 2 3000 6000Could 2 4000 8000Applausive 1 8000 8000Ziph’s Lawfrequency ∝1rankIn other words, there is a constant, k, such that f · r = k.Language structure and meaningWe want to know how me aning is mapped onto what language structures.Commonly in English in ways like this:[Thing The dog] is [Place in the garden][Thing The dog] is [Property fierce][Action [Thing The dog] is chasing [Thing the cat]][State [Thing The dog] was sitting [Place in the garden] [Timeyesterday]][Action [Thing We] ran [Path out into the water]][Action [Thing The dog] barked [Property/Manner loudly]][Action [Thing The dog] barked [Property/Amount nonstop for fivehours]]Word categories: Traditional parts of speechNoun Names of things boy, cat, truthVerb Action or state become, hitPronoun Used for noun I, you, weAdverb Modifies V, Adj, Adv sadly, veryAdjective Modifies noun happy, cleverConjunction Joins things and, but, whilePreposition Relation of N to, from, intoInterjection An outcry ouch, oh, alas, psstPart of speech “Substitution Test”The {sad, intelligent, green, fat, ...} one is in the corner.ConstituencyThe idea: Groups of words may behave as a single unit or phrase, called aconsituent.E.g. Noun PhraseKermit the frogtheyDecember twenty-sixththe reason he is running for presidentConstituencySentences have parts, some of which appear to have subparts. Thesegroupings of words that go together we will call constituents.(How do we know they go together? Coming in a few slides...)I hit the man with a cleaverI hit [the man with a cleaver]I hit [the man] with a cleaverYou could not go to her partyYou [could not] go to her partyYou could [not go] to her partyConstituent PhrasesFor constituents, we usually name them as phrases based on the word thatheads the constituent:the man from Amherst is a Noun Phrase (NP) because the head man is a nounextremely clever is an Adjective Phrase (AP) because the head clever is an adjectivedown the river is a Prepositional Phrase (PP) because the head down is a prepositionkilled the rabbit is a Verb Phrase (VP) because the head killed is a verbNote that a word is a constituent (a little one). Sometimes words also actas phrases. In:Joe grew potatoes.Joe and potatoes are both nouns and noun phrases.Compare with:The man from Amherst grew beautiful russet potatoes.We say Joe counts as a noun phrase because it appears in a place that alarger noun phrase could have been.Evidence constituency exists1. They appear in similar environments (before a verb)Kermit the frog comes on stageThey come to Massachusetts every summerDecember twenty-sixth comes after ChristmasThe reason he is running for president comes out only now.But not each individual word in the consituent*The comes our... *is comes out... *for comes out...2. The constituent can be placed in a number of different locationsConsituent = Prepositional phrase: On December twenty-sixthOn December twenty-sixth I’d like to fly to Florida.I’d like to fly on December twenty-sixth to Florida.I’d like to fly to Florida on December twenty-sixth.But not split apart*On December I’d like to fly twenty-sixth to Florida.*On I’d like to fly December twenty-sixth to Florida.Context-free grammarThe most common way of modeling constituency.CFG = Context-Free Grammar = Phrase Structure Grammar= BNF = Backus-Naur FormThe idea of basing a grammar on constituent structure dates back toWilhem Wundt (1890), but not formalized until Chomsky (1956), and,independently, by Backus (1959).Context-free grammarG = hT, N, S, Ri• T is set of terminals (lexicon)• N is set of non-terminals For NLP, we usually distinguish out a setP ⊂ N of preterminals which always rewrite as terminals.• S is start symbol (one of the nonterminals)• R is rules/productions of the form X → γ, where X is a nonterminaland γ is a sequence of terminals and nonterminals (may be empty).• A grammar G generates a language L.An example context-free grammarG = hT, N, S, RiT = {that, this, a, the, man, book, flight, meal, include, read,


View Full Document

UMass Amherst CS 585 - Context Free Grammars

Download Context Free Grammars
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Context Free Grammars and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Context Free Grammars 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?