1CS 294-5: StatisticalNatural Language ProcessingSpeech RecognitionLecture 20: 11/22/05Slides directly from Dan Jurafsky, indirectly many othersSpeech Recognition Overview: Demo Phonetics Articulatory Acoustic Acoustic Models HMM Lexicons Gaussian Mixtures Speech Synthesis Proposal: Nov 23, 28: Recognition Nov 30, Dec 7: Project Presentations Dec 5: SynthesisASR for Dialog Systems Standard ASR maps sound to words But specific needs for dialogue systems Language models (what can be said) could depend on where we are in the dialogue Could make use of the fact that we are talking to the same human over time. Barge-in (human will talk over the computer) Confidence values: want to know if we misunderstood the human!State-of-the-Art: Recognition Accuracy measured by word error rate (WER) Speaker independent: Continuous digit strings, over the telephone: <0.3% Continuous dictation: 3-5% Continuous broadcast news: 5-7% Continuous multispeaker conversations over the telephone: 50%+ Commercials: 80%+ Speaker dependent: 30 min training, good microphone, dictation: 2-3%Databases Read speech (wideband, head- mounted mike) Resource Management (RM) 1000 word vocabulary, used in the 80s WSJ (Wall Street Journal) Reporters read the paper out loud “Verbalized punctuation” or “non-verbalized punctuation” Broadcast Speech (wideband) Broadcast News (“Hub 4”) English, Mandarin, Arabic Conversational Speech (telephone) Switchboard CallHome FisherNasal CavityPharynxVocal Folds (within the Larynx)TracheaLungsText copyright J. J. Ohala, Sept 2001, from Sharon Rose slideSagittal section of the vocal tract(Techmer 1880)2Places of articulationlabialdentalalveolarpost-alveolar/palatalvelaruvularpharyngeallaryngeal/glottalFigure thanks to Jennifer VendittiLabial placebilabiallabiodentalFigure thanks to Jennifer VendittiBilabial:p, b, mLabiodental:f, vCoronal placedentalalveolarpost-alveolar/palatalFigure thanks to Jennifer VendittiDental:th/dhAlveolar:t/d/s/z/lPost:sh/zh/yDorsal PlacevelaruvularpharyngealFigure thanks to Jennifer VendittiVelar:k/g/ngManner of Articulation Stop: complete closure of articulators, so no air escapes through mouth Oral stop: palate is raised, no air escapes through nose. Air pressure builds up behind closure, explodes when released p, t, k, b, d, g Nasal stop: oral closure, but palate is lowered, air escapes through nose. m, n, ngOral vs. Nasal Sounds Thanks to Jong-bok Kim for this figure!3VowelsIY AA UWFig. from Eric KellerSimple Period Waves (sine waves)Time (s)00.02–0.990.990• Characterized by:• period: T• amplitude A• phase φ• Fundamental frequencyin cycles per second, or Hz•F0=1/T1 cycleSimple periodic waves of soundTime (s)00.02–0.990.990•Y axis: Amplitude = amount of air pressure at that point in time•Zero is normal air pressure, negative is rarefaction•X axis: time. Frequency = number of cycles per second.• Frequency = 1/Period•20 cycles in .02 seconds = 1000 cycles/second = 1000 HzComplex waves: Adding a 100 Hz and 1000 Hz wave togetherTime (s)00.05–0.96540.990Spectrum1001000Frequency in HzAmplitudeFrequency components (100 and 1000 Hz) on x-axisSpectrum of one instant in an actual soundwave: many components across frequency rangeFrequency (Hz)05000020404Waveforms for speech Waveform of the vowel [iy] Frequency: repetitions/second of a wave Above vowel has 28 reps in .11 secs So freq is 28/.11 = 255 Hz This is speed that vocal folds move, hence voicing Amplitude: y axis: amount of air pressure at that point in time Zero is normal air pressure, negative is rarefactionShe just had a baby What can we learn from a wavefile? Vowels are voiced, long, loud Length in time = length in space in waveform picture Voicing: regular peaks in amplitude When stops closed: no peaks: silence. Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on Silence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for second [b]) Fricatives like [sh] intense irregular pattern; see .33 to .46Examples from LadefogedbadpadspatPart of [ae] waveform from “had” Note complex wave repeating nine times in figure Plus smaller waves which repeats 4 times for every large pattern Large wave has frequency of 250 Hz (9 times in .036 seconds) Small wave roughly 4 times this, or roughly 1000 Hz Two little tiny waves on top of peak of 1000 Hz wavesBack to Spectra Spectrum represents these freq components Computed by Fourier transform, algorithm which separates out each frequency component of wave. x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude) Peaks at 930 Hz, 1860 Hz, and 3020 Hz.Why these Peaks? Articulatory facts: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of mouth, some harmonics are amplified more than others5Deriving schwa: how shape of mouth (filter function) creates peaks! Reminder of basic facts about sound waves f = c/λ c = speed of sound (approx 35,000 cm/sec) A sound with λ=10 meters: f = 35 Hz (35,000/1000) A sound with λ=2 centimeters: f = 17,500 Hz (35,000/2)Resonances of the vocal tract The human vocal tract as an open tube Air in a tube of a given length will tend to vibrate at resonance frequency of tube. Constraint: Pressure differential should be maximal at (closed) glottal end and minimal at (open) lip end.Closed endOpen endLength 17.5 cm.Figure from W. Barry Speech Science slidesFrom SundbergComputing the 3 Formants of Schwa Let the length of the tube be L F1= c/λ1= c/(4L) = 35,000/4*17.5 = 500Hz F2= c/λ2= c/(4/3L) = 3c/4L = 3*35,000/4*17.5 = 1500Hz F1= c/λ2= c/(4/5L) = 5c/4L = 5*35,000/4*17.5 = 2500Hz So we expect a neutral vowel to have 3 resonances at 500, 1500, and 2500 Hz These vowel resonances are called formantsFromMarkLiberman’sWeb siteSeeing formants: the spectrogram6American English Vowel SpaceFRONT BACKHIGHLOWeyowawoyayiyihehaeaaaouwuhahaxix uxFigure from Jennifer VendittiDialect Issues Speech varies from dialect to dialect (examples are American vs. British English) Syntactic (“I could” vs. “I could do”) Lexical (“elevator” vs. “lift”) Phonological (butter: [I©5] vs. [I©(]) Phonetic Mismatch between
View Full Document