Speech Processing 15-492/18-492Speech RecognitionTemplate matchingSpeech Recognition by TemplatesA little history …A little history …Matching TemplatesMatching TemplatesDTW (Dynamic Time Warping)DTW (Dynamic Time Warping)Beyond template matchingBeyond template matchingRadio Rex (1922)• Toys always lead technology …• Call “Rex” and he comes out of his kennel• (Crystalradio.com and Rhys Jones)Toy ASR“Tricks”Radio RexRadio RexRecognizes vowel formants in “EH”Recognizes vowel formants in “EH”Voice activated toy trainVoice activated toy trainMultilingual stop/go Multilingual stop/go hashire/tomatehashire/tomateToys “pets” don’t need perfect ASRToys “pets” don’t need perfect ASRTemplate MatchingRecord templates from userRecord templates from userStore in libraryStore in libraryRecord ASR exampleRecord ASR exampleCompare against each library templateCompare against each library templateSelect closest exampleSelect closest exampleFor example …For example …On a voice dialing systemOn a voice dialing systemVoice Dialing System• Library– Mom– Dad– Bob– Mario’s Pizza– Let’s Go Bus Information SystemMatching in Time DomainDurationDurationWill discriminate some examplesWill discriminate some examplesBut Mom, Bob and Dad will be confusedBut Mom, Bob and Dad will be confusedWhat about spectral propertiesWhat about spectral propertiesMatching in Frequency DomainMomBobDifferent deliveriesWe change durationsWe change durationsTwo utterances are never the sameTwo utterances are never the sameWhen it fails we change our deliveryWhen it fails we change our deliveryBecome more Become more articulararticular“clearer”“clearer”Dynamic Time WarpingTemplateSample SpeechDTW algorithmFor each square For each square Dist(template[i],sample[jDist(template[i],sample[j]) +]) +smallest_ofsmallest_of(Dist(template[i(Dist(template[i--1],sample[j])1],sample[j])Dist(template[i],sample[jDist(template[i],sample[j--1])1])Dist(template[iDist(template[i--1],sample[j1],sample[j--1])1])Remember which choice your took (count path)Remember which choice your took (count path)TemplateSamplej-1 jii-1Multiple TemplatesCompare against eachCompare against eachFind closestFind closestNeed to normalize scoresNeed to normalize scores(divide by length of matches)(divide by length of matches)Matching TemplatesSampleTemplate LibraryWord0Word1Word2…For Word in TemplatesScore = dtw(Template[Word], Sample);if (Score < BestScore)BestWord = Word;DoAction(Action[BestWord])DTW issuesWhat happens with noWhat happens with no--matchesmatchesNeed to deal with none of the aboveNeed to deal with none of the aboveWhat happens with more templatesWhat happens with more templatesHarder to choose betweenHarder to choose betweenOnce variance greater than differencesOnce variance greater than differencesChoose templates that are very differentChoose templates that are very differentDTW/Template ApplicationsVoice dialerVoice dialerSimple command and controlSimple command and controlSpeaker IDSpeaker IDSpeaker IDSampleTemplate LibrarySpeaker0Speaker1Speaker2…For Speaker in TemplatesScore = dtw(Template[Speaker], Sample);if (Score < BestScore)BestSpeaker = Speaker;DTWAdvantagesAdvantagesWorks well for small number of templates (<20)Works well for small number of templates (<20)Language independentLanguage independentSpeaker specificSpeaker specificEasy to train (end user controls it)Easy to train (end user controls it)DisadvantagesDisadvantagesLimited number of templatesLimited number of templatesSpeaker specificSpeaker specificNeed actual training examplesNeed actual training examplesMore reliable matching• Distance metric– Euclidean • But some distances are bigger than others– Silence is pretty similar– Fricatives are quite larger• A longer fricative might give large score• A longer vowel might give smaller scoreMore reliable matching• Having multiple template examples– Individual matches or– Average them together• DTW align all of the examples• Collect statistics as a Gaussian– Mean and standard deviation for each coeffMore reliable distances• Instead of Euclidean distance– Doesn’t care about the standard deviation• Use Mahalanobis distance– Care about means and standard deviationExtending Template matchingString word templates togetherString word templates togetherNeed to find word segmentationNeed to find word segmentationBut there are many words …But there are many words …Word0Word1Word2…Extending template modelString phoneme templates togetherString phoneme templates togetherA template model for each phonemeA template model for each phonemek ae tSamplePhone0Phone1Phone2…Phoneme TemplatesSummarySpeech Recognition by TemplatesSpeech Recognition by TemplatesGood for simple small vocabulary tasksGood for simple small vocabulary tasksDynamic Time Warping (DTW)Dynamic Time Warping (DTW)Can match different durational examplesCan match different durational examplesAveraging over multiple modelsAveraging over multiple modelsDistance metricsDistance metricsEuclidean Euclidean
View Full Document