DOC PREVIEW
Berkeley COMPSCI 188 - NLP

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 188: Artificial IntelligenceSpring 2006Lecture 27: NLP4/27/2006Dan Klein – UC BerkeleyWhat is NLP? Fundamental goal: deep understand of broad language Not just string processing or keyword matching! End systems that we want to build: Ambitious: speech recognition, machine translation, information extraction, dialog interfaces, question answering… Modest: spelling correction, text categorization…2Why is Language Hard? Ambiguity EYE DROPS OFF SHELF MINERS REFUSE TO WORK AFTER DEATH KILLER SENTENCED TO DIE FOR SECOND TIME IN 10 YEARS LACK OF BRAINS HINDERS RESEARCHThe Big Open Problems Machine translation Information extraction Solid speech recognition Deep content understanding3Machine Translation Translation systems encode: Something about fluent language Something about how two languages correspond SOTA: for easy language pairs, better than nothing, but more an understanding aid than a replacement for human translatorsInformation Extraction Information Extraction (IE) Unstructured text to database entries SOTA: perhaps 70% accuracy for multi- sentence temples, 90%+ for single easy fieldsNew York Times Co. named Russell T. Lewis, 45, president and general manager of its flagship New York Times newspaper, responsible for all business-side activities. He was executive vice president and deputy general manager. He succeeds Lance R. Primis, who in September was named president and chief operating officer of the parent. startpresident and CEONew York Times Co.Lance R. Primisendexecutive vice presidentNew York Times newspaperRussell T. Lewisstartpresident and general managerNew York Times newspaperRussell T. LewisStatePostCompanyPerson4Question Answering Question Answering: More than search Ask general comprehension questions of a document collection Can be really easy: “What’s the capital of Wyoming?” Can be harder: “How many US states’ capitals are also their largest cities?” Can be open ended: “What are the main issues in the global warming debate?” SOTA: Can do factoids, even when text isn’t a perfect matchModels of Language Two main ways of modeling language Language modeling: putting a distribution P(s) over sentences s Useful for modeling fluency in a noisy channel setting, like machine translation or ASR Typically simple models, trained on lots of data Language analysis: determining the structure and/or meaning behind a sentence Useful for deeper processing like information extraction or question answering Starting to be used for MT5The Speech Recognition Problem We want to predict a sentence given an acoustic sequence: The noisy channel approach: Build a generative model of production (encoding) To decode, we use Bayes’ rule to write Now, we have to find a sentence maximizing this product)|(maxarg* AsPss=)|()(),( sAPsPsAP=)|(maxarg* AsPss=)(/)|()(maxarg APsAPsPs=)|()(maxarg sAPsPs=N-Gram Language Models No loss of generality to break sentence probability down with the chain rule Too many histories! N-gram solution: assume each word depends only on a short linear history∏−=iiinwwwwPwwwP )|()(12121……∏−−=iikiinwwwPwwwP )|()(121……6Unigram Models Simplest case: unigrams Generative process: pick a word, pick another word, … As a graphical model: To make this a proper distribution over sentences, we have to generate a special STOP symbol last. (Why?) Examples: [fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass.] [thrift, did, eighty, said, hard, 'm, july, bullish] [that, or, limited, the] [] [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too, allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, the, companies, which, rivals, an, because, longer, oakes, percent, a, they, three, edward, it, currier, an, within, in, three, wrote, is, you, s., longer, institute, dentistry, pay, however, said, possible, to, rooms, hiding, eggs, approximate, financial, canada, the, so, workers, advancers, half, between, nasdaq]∏=iinwPwwwP )()(21…w1w2wn-1STOP………….Bigram Models Big problem with unigrams: P(the the the the) >> P(I like ice cream) Condition on last word: Any better? [texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen] [outside, new, car, parking, lot, of, the, agreement, reached] [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty, seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, sharedata, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching] [this, would, be, a, record, november]∏−=iiinwwPwwwP )|()(121…w1w2wn-1STOPSTART700.20.40.60.810 200000 400000 600000 800000 1000000Num be r of WordsFraction SeenUnigramsBigramsRulesSparsity Problems with n-gram models: New words appear all the time: Synaptitute 132,701.03 fuzzificational New bigrams: even more often Trigrams or more – still worse! Zipf’s Law Types (words) vs. tokens (word occurences) Broadly: most word types are rare Specifically:  Rank word types by token frequency Frequency inversely proportional to rank Not special to language: randomly generated character strings have this propertySmoothing We often want to make estimates from sparse statistics: Smoothing flattens spiky distributions so they generalize better Very important all over NLP, but easy to do badly!P(w | denied the)3 allegations2 reports1 claims1 request7 totalallegationsattackmanoutcome…allegationsreportsclaimsattackrequestmanoutcome…allegationsreportsclaimsrequestP(w | denied the)2.5 allegations1.5 reports0.5 claims0.5 request2 other7 total8Phrase Structure Parsing Phrase structure parsing organizes syntax into constituents or brackets In general, this involves nested trees Linguists can, and do, argue about details Lots of ambiguity Not the only kind of syntax…new art critics write reviews with


View Full Document

Berkeley COMPSCI 188 - NLP

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download NLP
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view NLP and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view NLP 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?