DOC PREVIEW
CORNELL CS 674 - Study Notes

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS674 Natural Language Processing Last class– Noisy channel model– Bayesian method for the pronunciation subproblem in speech recognition Today – Introduction to generative models of language» What are they?» Why they’re important» Issues for counting words» Statistics of natural language» Unsmoothed n-gram modelsMotivation for generative models Word prediction– Once upon a…– I’d like to make a collect…– Let’s go outside and take a… The need for models of word prediction in NLP has not been uncontroversial– But it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. -Noam Chomsky (1969)– Every time I fire a linguist the recognition rate improves. -Fred Jelinek (IBM speech group, 1988)Why are word prediction models important? Augmentative communication systems– For the disabled, to predict the next words the user wants to “speak” Computer-aided education– System that helps kids learn to read (e.g. Mostow et al. system) Speech recognition– Use preceding context to improve solutions to the subproblem of pronunciation variation Context-sensitive spelling– Provide a better model of the prior --- P(w) Lexical tagging tasks …Why are word prediction models important? Closely related to the problem of computing the probability of a sequence of words– Can be used to assign a probability to the next word in an incomplete sentence– Useful for part-of-speech tagging, probabilistic parsingN-gram model Uses the previous N-1 words to predict the next one– 2-gram: bigram– 3-gram: trigram In speech recognition, these statistical models of word sequences are referred to as a language modelCounting words in corpora Ok, so how many words are in this sentence? Depends on whether or not we treat punctuation marks as words– Important for many NLP tasks» Grammar-checking, spelling error detection, author identification, part-of-speech tagging Spoken language corpora– Utterances don’t usually have punctuation, but they do have other phenomena that we might or might not want to treat as words» I do uh main- mainly business data processing– Fragments– Filled pauses» um and uh behave more like words, so most speech recognition systems treat them as suchCounting words in corpora Capitalization– Should They and they be treated as the same word?» For most statistical NLP applications, they are» Sometimes capitalization information is maintained as a featureE.g. spelling error correction, part-of-speech tagging Inflected forms– Should walks and walk be treated as the same word?» No…for most n-gram based systems» based on the wordform (i.e. the inflected form as it appears in the corpus) rather than the lemma (i.e. set of lexical forms that have the same stem)Counting words in corpora Need to distinguish– word types » the number of distinct words– word tokens» the number of running words Example– All for one and one for all.– 8 tokens (counting punctuation)– 6 types (assuming capitalized and uncapitalizedversions of the same token are treated separately)Topics for today Today – Introduction to generative models of language» What are they?» Why they’re important» Issues for counting words» Statistics of natural language» Unsmoothed n-gram modelsHow many words are there in English? Option 1: count the word entries in a dictionary– OED: 600,000– American Heritage (3rdedition): 200,000– Actually counting lemmas not wordforms Option 2: estimate from a corpus– Switchboard (2.4 million wordform tokens): 20,000 wordform types– Shakespeare’s complete works: 884,647 wordformtokens; 29,066 wordform types– Brown corpus (1 million tokens): 61,805 wordformtypes Æ 37,851 lemma types– Brown et al. 1992: 583 million wordform tokens, 293,181 wordform typesHow are they distributed?frequencyrank in frequency listfunction wordscontent wordsrare wordsStatistical Properties of Text The most frequent words in one corpus may be rare words in another corpus– Example: “computer” in CACM vs. National Geographic Each corpus has a different, fairly small “working vocabulary”These properties hold in a wide range of languages Zipf’s Law relates a term’s frequency to its rank– frequency 1/rank– There is a constant k such that freq * rank = kZipf’s Law (Tom Sawyer)Manning and Schutze SNLPZipf’s Law Useful as a rough description of the frequency distribution of words in human languages Behavior occurs in a surprising variety of situations– English verb polysemy– References to scientific papers– Web page in-degrees, out-degrees– Royalties to pop-music composersTopics for today Today – Introduction to generative models of language» What are they?» Why they’re important» Issues for counting words» Statistics of natural language» Unsmoothed n-gram modelsModels of word sequences Simplest model– Let any word follow any other word» P (word1 follows word2) = 1/# words in English Probability distribution at least obeys actual relative word frequencies» P (word1 follows word2) = # occurrences of word1 / # words in English Pay attention to the preceding words– “Let’s go outside and take a [ ]”» walk very reasonable» break quite reasonable» lion less reasonable– Compute conditional probability P (walk| let’s go…)Probability of a word sequence P (w1, w2,…, wn-1, wn) Problem?  Solution: approximate the probability of a word given all the previous words…N-gram approximations Bigram model Trigram model– Conditions on the two preceding words N-gram approximation Markov assumption: probability of some future event (next word) depends only on a limited history of preceding events (previous words)Bigram grammar fragment Berkeley Restaurant Project Can compute the probability of a complete string– P (I want to eat British food) = P(I|<s>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British)Training N-gram models N-gram models can be trained by counting and normalizing– Bigrams– General case…– An example of Maximum Likelihood Estimation (MLE)» Resulting parameter set is one in which the likelihood of the training set T given the model M (i.e. P(T|M)) is maximized.Bigram counts Note the number of 0’s… Will look soon at


View Full Document
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?