DOC PREVIEW
Berkeley COMPSCI 294 - N-Grams and Smoothing

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 294-5: StatisticalNatural Language ProcessingN-Grams and SmoothingLecture 2: 8/31/05Speech in a Slide (or Three) Speech input is an acoustic wave forms p ee ch l a bGraphs from Simon Arnfield’s web tutorial on speech, Sheffield:http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/Some later bits from Joshua Goodman’s LM tutorial“l” to “a”transition: Frequency gives pitch; amplitude gives volume sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec) Fourier transform of wave displayed as a spectrogram darkness indicates energy at each frequencys p ee ch l a bfrequencyamplitudeSpectral Analysis Acoustic Feature Sequence Time slices are translated into acoustic feature vectors (~15 real numbers per slice) Now we have to figure out a mapping from sequences of acoustic observations to words.frequency……………………………………………..a12a13a12a14a14………..The Speech Recognition Problem We want to predict a sentence given an acoustic sequence: The noisy channel approach: Build a generative model of production (encoding) To decode, we use Bayes’ rule to write Now, we have to find a sentence maximizing this product Why is this progress?)|(maxarg* AsPss=)|()(),( sAPsPsAP =)|(maxarg* AsPss=)(/)|()(maxarg APsAPsPs=)|()(maxarg sAPsPs=Other Noisy-Channel Processes Handwriting recognition OCR Spelling Correction Translation?)|()()|( textstrokesPtextPstrokestextP∝)|()()|( textpixelsPtextPpixelstextP∝)|()()|( texttyposPtextPtypostextP∝)|()()|( englishfrenchPenglishPfrenchenglishP∝2Just a Code? “Also knowing nothing official about, but having guessed and inferred considerable about, the powerful new mechanized methods in cryptography—methods which I believe succeed even when one does not know what language has been coded—one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ” Warren Weaver (1955:18, quoting a letter he wrote in 1947)MT System ComponentssourceP(e)efdecoderobserved argmax P(e|f) = argmax P(f|e)P(e)eeefbestchannelP(f|e)Language Model Translation ModelLevels of TransferInterlinguaSemanticStructureSemanticStructureSyntacticStructureSyntacticStructureWordStructureWordStructureSource TextTarget TextSemanticCompositionSemanticDecompositionSemanticAnalysisSemanticGenerationSyntacticAnalysisSyntacticGenerationMorphologicalAnalysisMorphologicalGenerationSemanticTransferSyntacticTransferDirect(Vauquoistriangle)Probabilistic Language Models Want to build models which assign scores to sentences. P(I saw a van) >> P(eyes awe of an) Not really grammaticality: P(artichokes intimidate zippers) ≈ 0 One option: empirical distribution over sentences? Problem: doesn’t generalize (at all) Two ways of generalizing Decomposition: sentences generated in small steps which can be recombined in other ways Smoothing: allow for the possibility of unseen eventsN-Gram Language Models No loss of generality to break sentence probability down with the chain rule Too many histories! N-gram solution: assume each word depends only on a short linear history∏−=iiinwwwwPwwwP )|()(12121KK∏−−=iikiinwwwPwwwP )|()(121KKRegular Languages? N-gram models are (weighted) regular processes Why can’t we model language like this? Linguists have many arguments why language can’t be merely regular. Long-distance effects:“The computer which I had just put into the machine room on the fifth floor crashed.” Why CAN we often get away with n-gram models?3Unigram Models Simplest case: unigrams Generative process: pick a word, pick a word, … As a graphical model: To make this a proper distribution over sentences, we have to generate a special STOP symbol last. (Why?) Examples: [fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass.] [thrift, did, eighty, said, hard, 'm, july, bullish] [that, or, limited, the] [] [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too, allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, the, companies, which, rivals, an, because, longer, oakes, percent, a, they, three, edward, it, currier, an, within, in, three, wrote, is, you, s., longer, institute, dentistry, pay, however, said, possible, to, rooms, hiding, eggs, approximate, financial, canada, the, so, workers, advancers, half, between, nasdaq]∏=iinwPwwwP )()(21Kw1w2wn-1STOP………….Bigram Models Big problem with unigrams: P(the the the the) >> P(I like ice cream)! Condition on last word: Any better? [texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen] [outside, new, car, parking, lot, of, the, agreement, reached] [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty, seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, sharedata, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching] [this, would, be, a, record, november]∏−=iiinwwPwwwP )|()(121Kw1w2wn-1STOPSTARTIs This Working? The game isn’t to pound out fake sentences! What we really want to know is: Will our model prefer good sentences to bad ones? Bad ≠ ungrammatical! Bad ≈ unlikely Bad = sentences that our acoustic model really likes but aren’t the correct answerMeasuring Model Quality Word Error Rate (WER) The “right” measure: Task error driven For speech recognition For a specific recognizer! For general evaluation, we want a measure which references only good text, not mistake textCorrect answer: Andy saw a part of the movieRecognizer output: And he saw apart of the movieinsertions + deletions + substitutionstrue sentence sizeWER: 4/7 = 57%Measuring Model Quality The Shannon Game: How well


View Full Document

Berkeley COMPSCI 294 - N-Grams and Smoothing

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download N-Grams and Smoothing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view N-Grams and Smoothing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view N-Grams and Smoothing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?