Speech Processing 15-492/18-492ReviewASR, TTS, Dialog,S2S, VC, SID and CALLSpeech OverviewASRASRAutomatic Speech Recognition (AM and LM)Automatic Speech Recognition (AM and LM)TTSTTSText to speech: unit selection and statistical Text to speech: unit selection and statistical parametric synthesisparametric synthesisDialogDialogSpoken dialog systems: Spoken dialog systems: VoiceXMLVoiceXML, direct and , direct and mixed initiative dialogsmixed initiative dialogsSpeech OverviewVCVCVoice conversion, transformation, morphingVoice conversion, transformation, morphingSIDSIDSpeaker ID, Speaker recognitionSpeaker ID, Speaker recognitionCALLCALLComputer Aided Language LearningComputer Aided Language LearningS2SS2SSpeech to Speech translationSpeech to Speech translationASRAcoustic modelsAcoustic modelsAcoustic models (usually Acoustic models (usually HMMsHMMs) ) Modeling all ways to say each phonemeModeling all ways to say each phonemeLanguage modelsLanguage modelsModeling word sequence likelihoodsModeling word sequence likelihoodsTriTri--grams and grammarsgrams and grammarsASR• ASR and Bayes ruleBy Bayes ruleAcoustic model Language modelASR EvaluationWERWERWord error rate Word error rate vsvsAccuracyAccuracyWhat is the expected/acceptable WER ofWhat is the expected/acceptable WER ofDictationDictationDialog systemsDialog systemsSpeech IRSpeech IRConversational speech with a far field microphone with Conversational speech with a far field microphone with multiple overlapping nonmultiple overlapping non--native speakers (who know native speakers (who know each other) with heavily vehicle traffic in the each other) with heavily vehicle traffic in the backgrounbackgrounTTSText analysisText analysisHomographs, symbol, expansionHomographs, symbol, expansionLinguistic analysisLinguistic analysisPronunciation lexiconsPronunciation lexiconsProsody: breaks, intonation, durationProsody: breaks, intonation, durationWaveform synthesisWaveform synthesisFormant synthesis, concatenative synthesis, Formant synthesis, concatenative synthesis, statistical parametric synthesisstatistical parametric synthesisWaveform SynthesisDiphonesDiphonesMidMid--phone to midphone to mid--phone speech unitsphone speech unitsUnit selectionUnit selectionSelecting appropriate subSelecting appropriate sub--word units from large word units from large databases of natural speechdatabases of natural speechStatistical Parametric SpeechStatistical Parametric SpeechBuild speech model of “averages” of similar speechBuild speech model of “averages” of similar speechLimit domain synthesisLimit domain synthesisTargeted synthesis Targeted synthesisTTS EvaluationYes that sounds like a robotYes that sounds like a robotHuman listening testsHuman listening testsMOS scale for “likable”MOS scale for “likable”SUS sentences for understandabilitySUS sentences for understandabilityHuman personal Human personal prefrencesprefrences..Spoken Dialog SystemsVoiceXMLVoiceXML(and SALT)(and SALT)TreeTree--based dialog systemsbased dialog systemsOlympusOlympusMore general dialog systemsMore general dialog systemsSystem types:System types:System initiativeSystem initiativeMixed initiativeMixed initiativeHMIHY (How may I help you)HMIHY (How may I help you)Spoken Dialog System EvaluationTask completionTask completionCall lengthCall lengthNumber of turnsNumber of turns(Number of Calls)(Number of Calls)Break down byBreak down byNew/repeat callersNew/repeat callersDifferent usage typesDifferent usage typesNew LanguagesText examplesText examplesFor finding nice promptsFor finding nice promptsFor building language modelsFor building language modelsPhoneme definitionsPhoneme definitionsPronunciation lexiconPronunciation lexiconRecordingsRecordingsLots for ASR, one good one for TTSLots for ASR, one good one for TTSSpeech to SpeechReal timeReal timeTargeted/wide vocabularyTargeted/wide vocabularySpeech not textSpeech not textOften resource limited target languageOften resource limited target languageNeed a written form, and collect own dataNeed a written form, and collect own dataVoice ConversionConvert source text to target speakerConvert source text to target speakerSmall amount to target speaker (e.g. 30 Small amount to target speaker (e.g. 30 uttsutts))GMMGMM--based modelsbased modelsUses Uses Speaker conversionSpeaker conversionStyle conversionStyle conversionCross lingual voice conversionCross lingual voice conversionDeDe--identificationidentificationEvaluationEvaluationListeningListeningSpeaker ID systemsSpeaker ID systemsSpeaker IDSpeaker recognitionSpeaker recognitionWho is speakingWho is speakingSecurity, Security, passwdpasswdaccessaccessDiacritzationDiacritzation(who is speaking in a meeting)(who is speaking in a meeting)Speaker, language, dialect, style IDSpeaker, language, dialect, style IDTechniquesTechniquesGMM and Phone based techniquesGMM and Phone based techniquesCALLComputer aided language learningComputer aided language learningReading tutorsReading tutorsFirst and second language LearnersFirst and second language LearnersSecond language learnersSecond language learnersPronunciation trainersPronunciation trainersFluency practiceFluency practiceInteractive scenario experienceInteractive scenario
View Full Document