Berkeley COMPSCI 294 - N-Grams and Smoothing - D66534

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 294> N-Grams and Smoothing

DOC PREVIEW

Berkeley COMPSCI 294 - N-Grams and Smoothing

School name University of California, Berkeley

Course Compsci 294- Special Topics

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS 294-5: StatisticalNatural Language ProcessingN-Grams and SmoothingLecture 2: 8/31/05Speech in a Slide (or Three) Speech input is an acoustic wave forms p ee ch l a bGraphs from Simon Arnfield’s web tutorial on speech, Sheffield:http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/Some later bits from Joshua Goodman’s LM tutorial“l” to “a”transition: Frequency gives pitch; amplitude gives volume sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec) Fourier transform of wave displayed as a spectrogram darkness indicates energy at each frequencys p ee ch l a bfrequencyamplitudeSpectral Analysis Acoustic Feature Sequence Time slices are translated into acoustic feature vectors (~15 real numbers per slice) Now we have to figure out a mapping from sequences of acoustic observations to words.frequency……………………………………………..a12a13a12a14a14………..The Speech Recognition Problem We want to predict a sentence given an acoustic sequence: The noisy channel approach: Build a generative model of production (encoding) To decode, we use Bayes’ rule to write Now, we have to find a sentence maximizing this product Why is this progress?)|(maxarg* AsPss=)|()(),( sAPsPsAP =)|(maxarg* AsPss=)(/)|()(maxarg APsAPsPs=)|()(maxarg sAPsPs=Other Noisy-Channel Processes Handwriting recognition OCR Spelling Correction Translation?)|()()|( textstrokesPtextPstrokestextP∝)|()()|( textpixelsPtextPpixelstextP∝)|()()|( texttyposPtextPtypostextP∝)|()()|( englishfrenchPenglishPfrenchenglishP∝2Just a Code? “Also knowing nothing official about, but having guessed and inferred considerable about, the powerful new mechanized methods in cryptography—methods which I believe succeed even when one does not know what language has been coded—one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ” Warren Weaver (1955:18, quoting a letter he wrote in 1947)MT System ComponentssourceP(e)efdecoderobserved argmax P(e|f) = argmax P(f|e)P(e)eeefbestchannelP(f|e)Language Model Translation ModelLevels of TransferInterlinguaSemanticStructureSemanticStructureSyntacticStructureSyntacticStructureWordStructureWordStructureSource TextTarget TextSemanticCompositionSemanticDecompositionSemanticAnalysisSemanticGenerationSyntacticAnalysisSyntacticGenerationMorphologicalAnalysisMorphologicalGenerationSemanticTransferSyntacticTransferDirect(Vauquoistriangle)Probabilistic Language Models Want to build models which assign scores to sentences. P(I saw a van) >> P(eyes awe of an) Not really grammaticality: P(artichokes intimidate zippers) ≈ 0 One option: empirical distribution over sentences? Problem: doesn’t generalize (at all) Two ways of generalizing Decomposition: sentences generated in small steps which can be recombined in other ways Smoothing: allow for the possibility of unseen eventsN-Gram Language Models No loss of generality to break sentence probability down with the chain rule Too many histories! N-gram solution: assume each word depends only on a short linear history∏−=iiinwwwwPwwwP )|()(12121KK∏−−=iikiinwwwPwwwP )|()(121KKRegular Languages? N-gram models are (weighted) regular processes Why can’t we model language like this? Linguists have many arguments why language can’t be merely regular. Long-distance effects:“The computer which I had just put into the machine room on the fifth floor crashed.” Why CAN we often get away with n-gram models?3Unigram Models Simplest case: unigrams Generative process: pick a word, pick a word, … As a graphical model: To make this a proper distribution over sentences, we have to generate a special STOP symbol last. (Why?) Examples: [fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass.] [thrift, did, eighty, said, hard, 'm, july, bullish] [that, or, limited, the] [] [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too, allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, the, companies, which, rivals, an, because, longer, oakes, percent, a, they, three, edward, it, currier, an, within, in, three, wrote, is, you, s., longer, institute, dentistry, pay, however, said, possible, to, rooms, hiding, eggs, approximate, financial, canada, the, so, workers, advancers, half, between, nasdaq]∏=iinwPwwwP )()(21Kw1w2wn-1STOP………….Bigram Models Big problem with unigrams: P(the the the the) >> P(I like ice cream)! Condition on last word: Any better? [texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen] [outside, new, car, parking, lot, of, the, agreement, reached] [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty, seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, sharedata, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching] [this, would, be, a, record, november]∏−=iiinwwPwwwP )|()(121Kw1w2wn-1STOPSTARTIs This Working? The game isn’t to pound out fake sentences! What we really want to know is: Will our model prefer good sentences to bad ones? Bad ≠ ungrammatical! Bad ≈ unlikely Bad = sentences that our acoustic model really likes but aren’t the correct answerMeasuring Model Quality Word Error Rate (WER) The “right” measure: Task error driven For speech recognition For a specific recognizer! For general evaluation, we want a measure which references only good text, not mistake textCorrect answer: Andy saw a part of the movieRecognizer output: And he saw apart of the movieinsertions + deletions + substitutionstrue sentence sizeWER: 4/7 = 57%Measuring Model Quality The Shannon Game: How well

View Full Document

Berkeley COMPSCI 294 - N-Grams and Smoothing

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Berkeley COMPSCI 294 - N-Grams and Smoothing

Sign up for free to view:

Please select your school