Speech Processing 15-492/18-492Speech RecognitionAcoustic modelingPronunciation dictionaryAcoustic ModelingSpeech and Signal VariabilitySpeech and Signal VariabilityMeasuring ErrorMeasuring ErrorPronunciation lexiconsPronunciation lexiconsVariability in Speech Signal““MrMrWright should write to Ms Wright right Wright should write to Ms Wright right away about his Ford or four door Honda.away about his Ford or four door Honda.Homophones: same pronunciationHomophones: same pronunciation““wrightwright” “right” “write” / r ay t /” “right” “write” / r ay t /“ford or” “four door” / f “ford or” “four door” / f aoaor d r d aoaor /r /Style VariabilityDifferent articulation in different situationsDifferent articulation in different situationsClear Clear vsvsConversationalConversationalWhisper Whisper vsvsshoutingshoutingTalking to machine, talking to othersTalking to machine, talking to othersFrustrated speechFrustrated speechSpeaker variabilityGender, age, dialect, healthGender, age, dialect, healthSpeaker dependent systemsSpeaker dependent systemsSpeaker independent systemsSpeaker independent systemsSpeaker adaptive systemsSpeaker adaptive systemsEnrolment stage (acoustics and language)Enrolment stage (acoustics and language)Environment VariabilityDifferent background noisesDifferent background noisesOffice Office vsvsOutsideOutsideDifferent applications, different Different applications, different environmentsenvironmentsDesktop dictation, to Warehouse pickDesktop dictation, to Warehouse pickSingle speaker Single speaker vsvsMultispeakerMultispeakerBackground musicBackground musicChannel VariabilityTelephone Telephone vsvsDesktopDesktop8KHz 8KHz vsvs16KHz16KHzPDA PDA vsvsDesktopDesktopCloseClose--talking talking vsvsfarfar--fieldfieldCell Phone Cell Phone vsvsLandlineLandlineMeasuring Speech Recognition ErrorWord Error RateWord Error RateSubstitutions: word is replacedSubstitutions: word is replacedDeletions: word is missed outDeletions: word is missed outInsertions: word is addedInsertions: word is addedSubs+Dels+InsSubs+Dels+InsWER = 100% x WER = 100% x ----------------------------------------------------------------------word in correct sentenceword in correct sentenceWord Error RateWER requires:WER requires:Transcription (the correct word string)Transcription (the correct word string)Alignment between ASR output and TranscriptAlignment between ASR output and TranscriptNot just left to right matchingNot just left to right matchingSometimes Accuracy is givenSometimes Accuracy is given100100--WER WER NOT number of words correctNOT number of words correctWord Error RateCan get > 100%Can get > 100%But something is very wrongBut something is very wrongOutputting “the” only, ignoring the speechOutputting “the” only, ignoring the speechSometimes gives WER < 100%Sometimes gives WER < 100%All words are treated equalAll words are treated equal“This specimen” “This specimen” vsvs“The specimen”“The specimen”“Is absent” “Is absent” vsvs“Is present”“Is present”Signal AcquisitionHigh quality signal qualityHigh quality signal qualityLower sample rate will increase WERLower sample rate will increase WER8KHz baseline8KHz baseline16KHz 16KHz --10%10%End-Point DetectionLong silence will likely increase WERLong silence will likely increase WERIt will recognize phantom wordsIt will recognize phantom wordsNeed to find the speech in the signalNeed to find the speech in the signalVAD (Voice Activity Detection)VAD (Voice Activity Detection)Find beginning and end of speechFind beginning and end of speechTypically do continuous recognitionTypically do continuous recognitionRecognized while listeningRecognized while listeningBut need end point (have to wait)But need end point (have to wait)Feature normalizationSometimes do normalizationSometimes do normalizationRemove mean from Remove mean from MFCCsMFCCsCan make recognition more reliable in noiseCan make recognition more reliable in noiseOften include deltas and delta deltasOften include deltas and delta deltasSometimes to feature reductionSometimes to feature reductionPrincipal Component AnalysisPrincipal Component AnalysisWhat phones/segmentsNeed the best set for discriminationNeed the best set for discriminationNot necessary the same as Linguistic PhonesNot necessary the same as Linguistic PhonesMore phones means more trainingMore phones means more trainingAnd needs to have consistent LexiconAnd needs to have consistent LexiconExtra phonesExtra phonest t vsvsdxdxt t vsvsnxnx: /t w eh n t : /t w eh n t iyiy/ / vsvs/ t w eh / t w eh nxnxiyiy/ / Stops as closures and burstsStops as closures and burstsSchwas: ax and ixSchwas: ax and ixSyllabics: el, Syllabics: el, emem, en, enAccents/Tones: ah1, ah0, ….Accents/Tones: ah1, ah0, ….Context dependencyCare about the contexts of each phoneCare about the contexts of each phonePost vocalic /r/ and /n/ /m/ affect vowelPost vocalic /r/ and /n/ /m/ affect vowelUtterances start and end affect phonemesUtterances start and end affect phonemesNeed more than simple phone modelsNeed more than simple phone modelsTri-phone ModelsHave models for each phone and contextHave models for each phone and context43^3 contexts about 80K models43^3 contexts about 80K modelsNot all contexts have enough examplesNot all contexts have enough examplesoyoy((oyoy) ) oyoyvery rarevery rareshsh(ax) n very common(ax) n very commonMerge triMerge tri--phones that are similarphones that are similarE.gE.gt(ih)nt(ih)nwith with d(ih)nd(ih)nFind phones to mergeUsing phonetic featuresUsing phonetic featuresMost similar feature, most similar acousticsMost similar feature, most similar acousticsStops, voicing, vowel type …Stops, voicing, vowel type …Usually automatic cluster of Usually automatic cluster of triphonestriphonesUsing CART trees indexed by phonetic featuresUsing CART trees indexed by phonetic
View Full Document