CS 188: Artificial Intelligence Fall 2006AnnouncementsHidden Markov ModelsSpeech RecognitionDigitizing SpeechSpeech in an HourShe just had a babySpectral AnalysisAdding 100 Hz + 1000 Hz WavesSpectrumBack to SpectraVowel FormantsResonances of the vocal tractSlide 15Why these Peaks?Slide 17How to read spectrogramsAcoustic Feature SequenceState SpaceHMMs for SpeechASR Lexicon: Markov ModelsMarkov Process with BigramsDecodingViterbi AlgorithmViterbi with 2 Words + Unif. LMNext ClassSlide 31The Speech Recognition ProblemExamples from LadefogedSimple Periodic Sound WavesDeriving SchwaSlide 36Computing the 3 Formants of SchwaHMMs for Continuous Observations?Viterbi DecodingCS 188: Artificial IntelligenceFall 2006Lecture 21: Speech / Viterbi11/09/2006Dan Klein – UC BerkeleyAnnouncementsOptional midtermOn Tuesday 11/21 in classReview session 11/19, 7-9pm, in 306 SodaProjects3.2 due 11/93.3 due 11/153.4 due 11/27ContestPacman contest details on web site this weekEntries due 12/3Hidden Markov ModelsHidden Markov models (HMMs)Underlying Markov chain over states XYou observe outputs (effects) E at each time stepAs a Bayes’ net:Several questions you can answer for HMMs:Last time: filtering to track belief about current X given evidenceX5X2E1X1X3X4E2E3E4E5Speech Recognition[demos]Digitizing SpeechSpeech in an HourSpeech input is an acoustic wave form s p ee ch l a bGraphs from Simon Arnfield’s web tutorial on speech, Sheffield:http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/“l” to “a”transition:She just had a baby What can we learn from a wavefile?Vowels are voiced, long, loudLength in time = length in space in waveform pictureVoicing: regular peaks in amplitudeWhen stops closed: no peaks: silence.Peaks = voicing: .46 to .58 (vowel [i], from second .65 to .74 (vowel []) and so onSilence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for second [b])Fricatives like [] intense irregular pattern; see .33 to .46Frequency gives pitch; amplitude gives volumesampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec)Fourier transform of wave displayed as a spectrogramdarkness indicates energy at each frequency s p ee ch l a bfrequencyamplitudeSpectral AnalysisAdding 100 Hz + 1000 Hz WavesTime (s)0 0.05–0.96540.990Spectrum1001000Frequency in HzAmplitudeFrequency components (100 and 1000 Hz) on x-axisBack to SpectraSpectrum represents these freq componentsComputed by Fourier transform, algorithm which separates out each frequency component of wave. x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude) Peaks at 930 Hz, 1860 Hz, and 3020 Hz.Vowel FormantsResonances of the vocal tractThe human vocal tract as an open tubeAir in a tube of a given length will tend to vibrate at resonance frequency of tube. Constraint: Pressure differential should be maximal at (closed) glottal end and minimal at (open) lip end.Closed endOpen endLength 17.5 cm.Figure from W. Barry Speech Science slidesFromMarkLiberman’swebsiteWhy these Peaks? Articulatory facts:Vocal cord vibrations create harmonicsThe mouth is a selective amplifierDepending on shape of mouth, some harmonics are amplified more than othersFigures from Ratree Wayland slides from his websiteVowel [i] sung at successively higher pitch. 1234567How to read spectrogramsbab: closure of lips lowers all formants: so rapid increase in all formants at beginning of "bab”dad: first formant increases, but F2 and F3 slight fallgag: F2 and F3 come together: this is a characteristic of velars. Formant transitions take longer in velars than in alveolars or labialsFrom Ladefoged “A Course in Phonetics”Acoustic Feature SequenceTime slices are translated into acoustic feature vectors (~39 real numbers per slice)These are the observations, now we need the hidden states Xfrequency……………………………………………..e12e13e14e15e16………..State SpaceP(E|X) encodes which acoustic vectors are appropriate for each phoneme (each kind of sound)P(X|X’) encodes how sounds can be strung together We will have one state for each sound in each wordFrom some state x, can only:Stay in the same state (e.g. speaking slowly)Move to the next position in the wordAt the end of the word, move to the start of the next wordWe build a little state graph for each word and chain them together to form our state space XHMMs for SpeechASR Lexicon: Markov ModelsMarkov Process with BigramsFigure from Huang et al page 618DecodingWhile there are some practical issues, finding the words given the acoustics is an HMM inference problemWe want to know which state sequence x1:T is most likely given the evidence e1:T:Viterbi AlgorithmQuestion: what is the most likely state sequence given the observations?Slow answer: enumerate all possibilitiesBetter answer: cached incremental versionViterbi with 2 Words + Unif. LMFigure from Huang et al page 612Next ClassFinal part of the course: machine learningWe’ll start talking about how to learn model parameters (like probabilities) from dataOne of the most heavily used technologies in all of AIThe Speech Recognition ProblemWe want to predict a sentence given an acoustic sequence:The noisy channel approach:Build a generative model of production (encoding)To decode, we use Bayes’ rule to writeNow, we have to find a sentence maximizing this productWhy is this progress?)|(maxarg* AsPss)|()(),( sAPsPsAP )|(maxarg* AsPss)(/)|()(maxarg APsAPsPs)|()(maxarg sAPsPsExamples from LadefogedbadpadspatSimple Periodic Sound WavesTime (s)0 0.02–0.990.990Y axis: Amplitude = amount of air pressure at that point in timeZero is normal air pressure, negative is rarefactionX axis: time. Frequency = number of cycles per second.Frequency = 1/Period20 cycles in .02 seconds = 1000 cycles/second = 1000 HzDeriving SchwaReminder of basic facts about sound wavesf = c/c = speed of sound (approx 35,000 cm/sec)A sound with =10 meters: f = 35 Hz (35,000/1000)A sound with =2 centimeters: f = 17,500 Hz (35,000/2)From SundbergComputing the 3 Formants of SchwaLet the length of the tube be LF1 = c/1 = c/(4L) = 35,000/4*17.5 = 500HzF2 = c/2 = c/(4/3L) = 3c/4L =
View Full Document