Berkeley COMPSCI 188 - NLP - D1973832

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 188> NLP

DOC PREVIEW

Berkeley COMPSCI 188 - NLP

School name University of California, Berkeley

Course Compsci 188- Introduction to Artificial Intelligence

Pages 23

This preview shows page 1-2-22-23 out of 23 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS 188: Artificial IntelligenceSpring 2006Lecture 27: NLP4/27/2006Dan Klein – UC BerkeleyWhat is NLP? Fundamental goal: deep understand of broad language Not just string processing or keyword matching! End systems that we want to build: Ambitious: speech recognition, machine translation, information extraction, dialog interfaces, question answering… Modest: spelling correction, text categorization…2Why is Language Hard? Ambiguity EYE DROPS OFF SHELF MINERS REFUSE TO WORK AFTER DEATH KILLER SENTENCED TO DIE FOR SECOND TIME IN 10 YEARS LACK OF BRAINS HINDERS RESEARCHThe Big Open Problems Machine translation Information extraction Solid speech recognition Deep content understanding3Machine Translation Translation systems encode: Something about fluent language Something about how two languages correspond SOTA: for easy language pairs, better than nothing, but more an understanding aid than a replacement for human translatorsInformation Extraction Information Extraction (IE) Unstructured text to database entries SOTA: perhaps 70% accuracy for multi- sentence temples, 90%+ for single easy fieldsNew York Times Co. named Russell T. Lewis, 45, president and general manager of its flagship New York Times newspaper, responsible for all business-side activities. He was executive vice president and deputy general manager. He succeeds Lance R. Primis, who in September was named president and chief operating officer of the parent. startpresident and CEONew York Times Co.Lance R. Primisendexecutive vice presidentNew York Times newspaperRussell T. Lewisstartpresident and general managerNew York Times newspaperRussell T. LewisStatePostCompanyPerson4Question Answering Question Answering: More than search Ask general comprehension questions of a document collection Can be really easy: “What’s the capital of Wyoming?” Can be harder: “How many US states’ capitals are also their largest cities?” Can be open ended: “What are the main issues in the global warming debate?” SOTA: Can do factoids, even when text isn’t a perfect matchModels of Language Two main ways of modeling language Language modeling: putting a distribution P(s) over sentences s Useful for modeling fluency in a noisy channel setting, like machine translation or ASR Typically simple models, trained on lots of data Language analysis: determining the structure and/or meaning behind a sentence Useful for deeper processing like information extraction or question answering Starting to be used for MT5The Speech Recognition Problem We want to predict a sentence given an acoustic sequence: The noisy channel approach: Build a generative model of production (encoding) To decode, we use Bayes’ rule to write Now, we have to find a sentence maximizing this product)|(maxarg* AsPss=)|()(),( sAPsPsAP=)|(maxarg* AsPss=)(/)|()(maxarg APsAPsPs=)|()(maxarg sAPsPs=N-Gram Language Models No loss of generality to break sentence probability down with the chain rule Too many histories! N-gram solution: assume each word depends only on a short linear history∏−=iiinwwwwPwwwP )|()(12121……∏−−=iikiinwwwPwwwP )|()(121……6Unigram Models Simplest case: unigrams Generative process: pick a word, pick another word, … As a graphical model: To make this a proper distribution over sentences, we have to generate a special STOP symbol last. (Why?) Examples: [fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass.] [thrift, did, eighty, said, hard, 'm, july, bullish] [that, or, limited, the] [] [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too, allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, the, companies, which, rivals, an, because, longer, oakes, percent, a, they, three, edward, it, currier, an, within, in, three, wrote, is, you, s., longer, institute, dentistry, pay, however, said, possible, to, rooms, hiding, eggs, approximate, financial, canada, the, so, workers, advancers, half, between, nasdaq]∏=iinwPwwwP )()(21…w1w2wn-1STOP………….Bigram Models Big problem with unigrams: P(the the the the) >> P(I like ice cream) Condition on last word: Any better? [texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen] [outside, new, car, parking, lot, of, the, agreement, reached] [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty, seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, sharedata, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching] [this, would, be, a, record, november]∏−=iiinwwPwwwP )|()(121…w1w2wn-1STOPSTART700.20.40.60.810 200000 400000 600000 800000 1000000Num be r of WordsFraction SeenUnigramsBigramsRulesSparsity Problems with n-gram models: New words appear all the time: Synaptitute 132,701.03 fuzzificational New bigrams: even more often Trigrams or more – still worse! Zipf’s Law Types (words) vs. tokens (word occurences) Broadly: most word types are rare Specifically:  Rank word types by token frequency Frequency inversely proportional to rank Not special to language: randomly generated character strings have this propertySmoothing We often want to make estimates from sparse statistics: Smoothing flattens spiky distributions so they generalize better Very important all over NLP, but easy to do badly!P(w | denied the)3 allegations2 reports1 claims1 request7 totalallegationsattackmanoutcome…allegationsreportsclaimsattackrequestmanoutcome…allegationsreportsclaimsrequestP(w | denied the)2.5 allegations1.5 reports0.5 claims0.5 request2 other7 total8Phrase Structure Parsing Phrase structure parsing organizes syntax into constituents or brackets In general, this involves nested trees Linguists can, and do, argue about details Lots of ambiguity Not the only kind of syntax…new art critics write reviews with

View Full Document

Berkeley COMPSCI 188 - NLP

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-22-23 out of 23 pages.

Berkeley COMPSCI 188 - NLP

Sign up for free to view:

Please select your school