DOC PREVIEW
CMU CS 15492 - Speech Recognition Acoustic modeling Pronunciation dictionary

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Speech Processing 15-492/18-492Speech RecognitionAcoustic modelingPronunciation dictionaryAcoustic ModelingSpeech and Signal VariabilitySpeech and Signal VariabilityMeasuring ErrorMeasuring ErrorPronunciation lexiconsPronunciation lexiconsVariability in Speech Signal““MrMrWright should write to Ms Wright right Wright should write to Ms Wright right away about his Ford or four door Honda.away about his Ford or four door Honda.Homophones: same pronunciationHomophones: same pronunciation““wrightwright” “right” “write” / r ay t /” “right” “write” / r ay t /“ford or” “four door” / f “ford or” “four door” / f aoaor d r d aoaor /r /Style VariabilityDifferent articulation in different situationsDifferent articulation in different situationsClear Clear vsvsConversationalConversationalWhisper Whisper vsvsshoutingshoutingTalking to machine, talking to othersTalking to machine, talking to othersFrustrated speechFrustrated speechSpeaker variabilityGender, age, dialect, healthGender, age, dialect, healthSpeaker dependent systemsSpeaker dependent systemsSpeaker independent systemsSpeaker independent systemsSpeaker adaptive systemsSpeaker adaptive systemsEnrolment stage (acoustics and language)Enrolment stage (acoustics and language)Environment VariabilityDifferent background noisesDifferent background noisesOffice Office vsvsOutsideOutsideDifferent applications, different Different applications, different environmentsenvironmentsDesktop dictation, to Warehouse pickDesktop dictation, to Warehouse pickSingle speaker Single speaker vsvsMultispeakerMultispeakerBackground musicBackground musicChannel VariabilityTelephone Telephone vsvsDesktopDesktop8KHz 8KHz vsvs16KHz16KHzPDA PDA vsvsDesktopDesktopCloseClose--talking talking vsvsfarfar--fieldfieldCell Phone Cell Phone vsvsLandlineLandlineMeasuring Speech Recognition ErrorWord Error RateWord Error RateSubstitutions: word is replacedSubstitutions: word is replacedDeletions: word is missed outDeletions: word is missed outInsertions: word is addedInsertions: word is addedSubs+Dels+InsSubs+Dels+InsWER = 100% x WER = 100% x ----------------------------------------------------------------------word in correct sentenceword in correct sentenceWord Error RateWER requires:WER requires:Transcription (the correct word string)Transcription (the correct word string)Alignment between ASR output and TranscriptAlignment between ASR output and TranscriptNot just left to right matchingNot just left to right matchingSometimes Accuracy is givenSometimes Accuracy is given100100--WER WER NOT number of words correctNOT number of words correctWord Error RateCan get > 100%Can get > 100%But something is very wrongBut something is very wrongOutputting “the” only, ignoring the speechOutputting “the” only, ignoring the speechSometimes gives WER < 100%Sometimes gives WER < 100%All words are treated equalAll words are treated equal“This specimen” “This specimen” vsvs“The specimen”“The specimen”“Is absent” “Is absent” vsvs“Is present”“Is present”Signal AcquisitionHigh quality signal qualityHigh quality signal qualityLower sample rate will increase WERLower sample rate will increase WER8KHz baseline8KHz baseline16KHz 16KHz --10%10%End-Point DetectionLong silence will likely increase WERLong silence will likely increase WERIt will recognize phantom wordsIt will recognize phantom wordsNeed to find the speech in the signalNeed to find the speech in the signalVAD (Voice Activity Detection)VAD (Voice Activity Detection)Find beginning and end of speechFind beginning and end of speechTypically do continuous recognitionTypically do continuous recognitionRecognized while listeningRecognized while listeningBut need end point (have to wait)But need end point (have to wait)Feature normalizationSometimes do normalizationSometimes do normalizationRemove mean from Remove mean from MFCCsMFCCsCan make recognition more reliable in noiseCan make recognition more reliable in noiseOften include deltas and delta deltasOften include deltas and delta deltasSometimes to feature reductionSometimes to feature reductionPrincipal Component AnalysisPrincipal Component AnalysisWhat phones/segmentsNeed the best set for discriminationNeed the best set for discriminationNot necessary the same as Linguistic PhonesNot necessary the same as Linguistic PhonesMore phones means more trainingMore phones means more trainingAnd needs to have consistent LexiconAnd needs to have consistent LexiconExtra phonesExtra phonest t vsvsdxdxt t vsvsnxnx: /t w eh n t : /t w eh n t iyiy/ / vsvs/ t w eh / t w eh nxnxiyiy/ / Stops as closures and burstsStops as closures and burstsSchwas: ax and ixSchwas: ax and ixSyllabics: el, Syllabics: el, emem, en, enAccents/Tones: ah1, ah0, ….Accents/Tones: ah1, ah0, ….Context dependencyCare about the contexts of each phoneCare about the contexts of each phonePost vocalic /r/ and /n/ /m/ affect vowelPost vocalic /r/ and /n/ /m/ affect vowelUtterances start and end affect phonemesUtterances start and end affect phonemesNeed more than simple phone modelsNeed more than simple phone modelsTri-phone ModelsHave models for each phone and contextHave models for each phone and context43^3 contexts about 80K models43^3 contexts about 80K modelsNot all contexts have enough examplesNot all contexts have enough examplesoyoy((oyoy) ) oyoyvery rarevery rareshsh(ax) n very common(ax) n very commonMerge triMerge tri--phones that are similarphones that are similarE.gE.gt(ih)nt(ih)nwith with d(ih)nd(ih)nFind phones to mergeUsing phonetic featuresUsing phonetic featuresMost similar feature, most similar acousticsMost similar feature, most similar acousticsStops, voicing, vowel type …Stops, voicing, vowel type …Usually automatic cluster of Usually automatic cluster of triphonestriphonesUsing CART trees indexed by phonetic featuresUsing CART trees indexed by phonetic


View Full Document
Download Speech Recognition Acoustic modeling Pronunciation dictionary
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Speech Recognition Acoustic modeling Pronunciation dictionary and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Speech Recognition Acoustic modeling Pronunciation dictionary 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?