CMU CS 15492 - Speech Recognition Acoustic modeling Pronunciation dictionary - D1382512

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15492> Speech Recognition Acoustic modeling Pronunciation dictionary

DOC PREVIEW

CMU CS 15492 - Speech Recognition Acoustic modeling Pronunciation dictionary

School name Carnegie Mellon University

Course Cs 15492- Special Topic: Speech Processing

Pages 28

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 28 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 28 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 28 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 28 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 28 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 28 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 28 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Speech Processing 15-492/18-492Speech RecognitionAcoustic modelingPronunciation dictionaryAcoustic ModelingSpeech and Signal VariabilitySpeech and Signal VariabilityMeasuring ErrorMeasuring ErrorPronunciation lexiconsPronunciation lexiconsVariability in Speech Signal““MrMrWright should write to Ms Wright right Wright should write to Ms Wright right away about his Ford or four door Honda.away about his Ford or four door Honda.Homophones: same pronunciationHomophones: same pronunciation““wrightwright” “right” “write” / r ay t /” “right” “write” / r ay t /“ford or” “four door” / f “ford or” “four door” / f aoaor d r d aoaor /r /Style VariabilityDifferent articulation in different situationsDifferent articulation in different situationsClear Clear vsvsConversationalConversationalWhisper Whisper vsvsshoutingshoutingTalking to machine, talking to othersTalking to machine, talking to othersFrustrated speechFrustrated speechSpeaker variabilityGender, age, dialect, healthGender, age, dialect, healthSpeaker dependent systemsSpeaker dependent systemsSpeaker independent systemsSpeaker independent systemsSpeaker adaptive systemsSpeaker adaptive systemsEnrolment stage (acoustics and language)Enrolment stage (acoustics and language)Environment VariabilityDifferent background noisesDifferent background noisesOffice Office vsvsOutsideOutsideDifferent applications, different Different applications, different environmentsenvironmentsDesktop dictation, to Warehouse pickDesktop dictation, to Warehouse pickSingle speaker Single speaker vsvsMultispeakerMultispeakerBackground musicBackground musicChannel VariabilityTelephone Telephone vsvsDesktopDesktop8KHz 8KHz vsvs16KHz16KHzPDA PDA vsvsDesktopDesktopCloseClose--talking talking vsvsfarfar--fieldfieldCell Phone Cell Phone vsvsLandlineLandlineMeasuring Speech Recognition ErrorWord Error RateWord Error RateSubstitutions: word is replacedSubstitutions: word is replacedDeletions: word is missed outDeletions: word is missed outInsertions: word is addedInsertions: word is addedSubs+Dels+InsSubs+Dels+InsWER = 100% x WER = 100% x ----------------------------------------------------------------------word in correct sentenceword in correct sentenceWord Error RateWER requires:WER requires:Transcription (the correct word string)Transcription (the correct word string)Alignment between ASR output and TranscriptAlignment between ASR output and TranscriptNot just left to right matchingNot just left to right matchingSometimes Accuracy is givenSometimes Accuracy is given100100--WER WER NOT number of words correctNOT number of words correctWord Error RateCan get > 100%Can get > 100%But something is very wrongBut something is very wrongOutputting “the” only, ignoring the speechOutputting “the” only, ignoring the speechSometimes gives WER < 100%Sometimes gives WER < 100%All words are treated equalAll words are treated equal“This specimen” “This specimen” vsvs“The specimen”“The specimen”“Is absent” “Is absent” vsvs“Is present”“Is present”Signal AcquisitionHigh quality signal qualityHigh quality signal qualityLower sample rate will increase WERLower sample rate will increase WER8KHz baseline8KHz baseline16KHz 16KHz --10%10%End-Point DetectionLong silence will likely increase WERLong silence will likely increase WERIt will recognize phantom wordsIt will recognize phantom wordsNeed to find the speech in the signalNeed to find the speech in the signalVAD (Voice Activity Detection)VAD (Voice Activity Detection)Find beginning and end of speechFind beginning and end of speechTypically do continuous recognitionTypically do continuous recognitionRecognized while listeningRecognized while listeningBut need end point (have to wait)But need end point (have to wait)Feature normalizationSometimes do normalizationSometimes do normalizationRemove mean from Remove mean from MFCCsMFCCsCan make recognition more reliable in noiseCan make recognition more reliable in noiseOften include deltas and delta deltasOften include deltas and delta deltasSometimes to feature reductionSometimes to feature reductionPrincipal Component AnalysisPrincipal Component AnalysisWhat phones/segmentsNeed the best set for discriminationNeed the best set for discriminationNot necessary the same as Linguistic PhonesNot necessary the same as Linguistic PhonesMore phones means more trainingMore phones means more trainingAnd needs to have consistent LexiconAnd needs to have consistent LexiconExtra phonesExtra phonest t vsvsdxdxt t vsvsnxnx: /t w eh n t : /t w eh n t iyiy/ / vsvs/ t w eh / t w eh nxnxiyiy/ / Stops as closures and burstsStops as closures and burstsSchwas: ax and ixSchwas: ax and ixSyllabics: el, Syllabics: el, emem, en, enAccents/Tones: ah1, ah0, ….Accents/Tones: ah1, ah0, ….Context dependencyCare about the contexts of each phoneCare about the contexts of each phonePost vocalic /r/ and /n/ /m/ affect vowelPost vocalic /r/ and /n/ /m/ affect vowelUtterances start and end affect phonemesUtterances start and end affect phonemesNeed more than simple phone modelsNeed more than simple phone modelsTri-phone ModelsHave models for each phone and contextHave models for each phone and context43^3 contexts about 80K models43^3 contexts about 80K modelsNot all contexts have enough examplesNot all contexts have enough examplesoyoy((oyoy) ) oyoyvery rarevery rareshsh(ax) n very common(ax) n very commonMerge triMerge tri--phones that are similarphones that are similarE.gE.gt(ih)nt(ih)nwith with d(ih)nd(ih)nFind phones to mergeUsing phonetic featuresUsing phonetic featuresMost similar feature, most similar acousticsMost similar feature, most similar acousticsStops, voicing, vowel type …Stops, voicing, vowel type …Usually automatic cluster of Usually automatic cluster of triphonestriphonesUsing CART trees indexed by phonetic featuresUsing CART trees indexed by phonetic

View Full Document