DOC PREVIEW
Berkeley COMPSCI 294 - Statistical Natural Language Processing

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 294-5: StatisticalNatural Language ProcessingSpeech RecognitionLecture 20: 11/22/05Slides directly from Dan Jurafsky, indirectly many othersSpeech Recognition Overview: Demo Phonetics Articulatory Acoustic Acoustic Models HMM Lexicons Gaussian Mixtures Speech Synthesis Proposal: Nov 23, 28: Recognition Nov 30, Dec 7: Project Presentations Dec 5: SynthesisASR for Dialog Systems Standard ASR maps sound to words But specific needs for dialogue systems Language models (what can be said) could depend on where we are in the dialogue Could make use of the fact that we are talking to the same human over time. Barge-in (human will talk over the computer) Confidence values: want to know if we misunderstood the human!State-of-the-Art: Recognition Accuracy measured by word error rate (WER) Speaker independent: Continuous digit strings, over the telephone: <0.3% Continuous dictation: 3-5% Continuous broadcast news: 5-7% Continuous multispeaker conversations over the telephone: 50%+ Commercials: 80%+ Speaker dependent: 30 min training, good microphone, dictation: 2-3%Databases Read speech (wideband, head- mounted mike) Resource Management (RM) 1000 word vocabulary, used in the 80s WSJ (Wall Street Journal) Reporters read the paper out loud “Verbalized punctuation” or “non-verbalized punctuation” Broadcast Speech (wideband) Broadcast News (“Hub 4”) English, Mandarin, Arabic Conversational Speech (telephone) Switchboard CallHome FisherNasal CavityPharynxVocal Folds (within the Larynx)TracheaLungsText copyright J. J. Ohala, Sept 2001, from Sharon Rose slideSagittal section of the vocal tract(Techmer 1880)2Places of articulationlabialdentalalveolarpost-alveolar/palatalvelaruvularpharyngeallaryngeal/glottalFigure thanks to Jennifer VendittiLabial placebilabiallabiodentalFigure thanks to Jennifer VendittiBilabial:p, b, mLabiodental:f, vCoronal placedentalalveolarpost-alveolar/palatalFigure thanks to Jennifer VendittiDental:th/dhAlveolar:t/d/s/z/lPost:sh/zh/yDorsal PlacevelaruvularpharyngealFigure thanks to Jennifer VendittiVelar:k/g/ngManner of Articulation Stop: complete closure of articulators, so no air escapes through mouth Oral stop: palate is raised, no air escapes through nose. Air pressure builds up behind closure, explodes when released p, t, k, b, d, g Nasal stop: oral closure, but palate is lowered, air escapes through nose. m, n, ngOral vs. Nasal Sounds Thanks to Jong-bok Kim for this figure!3VowelsIY AA UWFig. from Eric KellerSimple Period Waves (sine waves)Time (s)00.02–0.990.990• Characterized by:• period: T• amplitude A• phase φ• Fundamental frequencyin cycles per second, or Hz•F0=1/T1 cycleSimple periodic waves of soundTime (s)00.02–0.990.990•Y axis: Amplitude = amount of air pressure at that point in time•Zero is normal air pressure, negative is rarefaction•X axis: time. Frequency = number of cycles per second.• Frequency = 1/Period•20 cycles in .02 seconds = 1000 cycles/second = 1000 HzComplex waves: Adding a 100 Hz and 1000 Hz wave togetherTime (s)00.05–0.96540.990Spectrum1001000Frequency in HzAmplitudeFrequency components (100 and 1000 Hz) on x-axisSpectrum of one instant in an actual soundwave: many components across frequency rangeFrequency (Hz)05000020404Waveforms for speech Waveform of the vowel [iy] Frequency: repetitions/second of a wave Above vowel has 28 reps in .11 secs So freq is 28/.11 = 255 Hz This is speed that vocal folds move, hence voicing Amplitude: y axis: amount of air pressure at that point in time Zero is normal air pressure, negative is rarefactionShe just had a baby What can we learn from a wavefile? Vowels are voiced, long, loud Length in time = length in space in waveform picture Voicing: regular peaks in amplitude When stops closed: no peaks: silence. Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on Silence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for second [b]) Fricatives like [sh] intense irregular pattern; see .33 to .46Examples from LadefogedbadpadspatPart of [ae] waveform from “had” Note complex wave repeating nine times in figure Plus smaller waves which repeats 4 times for every large pattern Large wave has frequency of 250 Hz (9 times in .036 seconds) Small wave roughly 4 times this, or roughly 1000 Hz Two little tiny waves on top of peak of 1000 Hz wavesBack to Spectra Spectrum represents these freq components Computed by Fourier transform, algorithm which separates out each frequency component of wave.  x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude) Peaks at 930 Hz, 1860 Hz, and 3020 Hz.Why these Peaks?  Articulatory facts: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of mouth, some harmonics are amplified more than others5Deriving schwa: how shape of mouth (filter function) creates peaks! Reminder of basic facts about sound waves f = c/λ c = speed of sound (approx 35,000 cm/sec) A sound with λ=10 meters: f = 35 Hz (35,000/1000) A sound with λ=2 centimeters: f = 17,500 Hz (35,000/2)Resonances of the vocal tract The human vocal tract as an open tube Air in a tube of a given length will tend to vibrate at resonance frequency of tube.  Constraint: Pressure differential should be maximal at (closed) glottal end and minimal at (open) lip end.Closed endOpen endLength 17.5 cm.Figure from W. Barry Speech Science slidesFrom SundbergComputing the 3 Formants of Schwa Let the length of the tube be L F1= c/λ1= c/(4L) = 35,000/4*17.5 = 500Hz F2= c/λ2= c/(4/3L) = 3c/4L = 3*35,000/4*17.5 = 1500Hz F1= c/λ2= c/(4/5L) = 5c/4L = 5*35,000/4*17.5 = 2500Hz So we expect a neutral vowel to have 3 resonances at 500, 1500, and 2500 Hz These vowel resonances are called formantsFromMarkLiberman’sWeb siteSeeing formants: the spectrogram6American English Vowel SpaceFRONT BACKHIGHLOWeyowawoyayiyihehaeaaaouwuhahaxix uxFigure from Jennifer VendittiDialect Issues Speech varies from dialect to dialect (examples are American vs. British English) Syntactic (“I could” vs. “I could do”) Lexical (“elevator” vs. “lift”) Phonological (butter: [I©5] vs. [I©(]) Phonetic Mismatch between


View Full Document

Berkeley COMPSCI 294 - Statistical Natural Language Processing

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Statistical Natural Language Processing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Statistical Natural Language Processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Statistical Natural Language Processing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?