Speech Processing 15-492/18-492Speaker IDWho is speaking?Speaker ID, Speaker RecognitionSpeaker ID, Speaker RecognitionWhen do you use itWhen do you use itSecurity, AccessSecurity, AccessSpeaker specific modelingSpeaker specific modelingRecognize the speaker and use their optionsRecognize the speaker and use their optionsDiacritizationDiacritizationIn multiIn multi--speaker environmentsspeaker environmentsAssign speech to different peopleAssign speech to different peopleAllow questions like did Fred agree or not.Allow questions like did Fred agree or not.Voice IdentityWhat makes a voice identityWhat makes a voice identityLexical Choice: Lexical Choice: WooWoo--hoohoo, , I pity the fool …I pity the fool …Phonetic choicePhonetic choiceIntonation and durationIntonation and durationSpectral qualities (vocal tract shape)Spectral qualities (vocal tract shape)ExcitationExcitationVoice IdentityWhat makes a voice identityWhat makes a voice identityLexical Choice: Lexical Choice: WooWoo--hoohoo, , I pity the fool …I pity the fool …Phonetic choicePhonetic choiceIntonation and durationIntonation and durationSpectral qualities (vocal tract shape)Spectral qualities (vocal tract shape)ExcitationExcitationBut which is most discriminative?But which is most discriminative?GMM Speaker ID Just looking at spectral partJust looking at spectral partWhich is sort of vocal tract shapeWhich is sort of vocal tract shapeBuild a single Gaussian of Build a single Gaussian of MFCCsMFCCsMeans and Standard Deviation of all speechMeans and Standard Deviation of all speechActually build NActually build N--mixture Gaussian (32 or 64)mixture Gaussian (32 or 64)Build a model for each speakerBuild a model for each speakerUse test data and see which model its Use test data and see which model its closest toclosest toGMM Speaker IDHow close does it need to be?How close does it need to be?One or two standard deviations?One or two standard deviations?The set of speakers needs to be differentThe set of speakers needs to be differentIf they are closest than one or two If they are closest than one or two stddevstddevYou get confusion.You get confusion.Should you have a “general” modelShould you have a “general” modelNot one of the set of training speakersNot one of the set of training speakersGMM Speaker IDWorks well on constrained tasksWorks well on constrained tasksIn similar acoustic conditionsIn similar acoustic conditions(not phone (not phone vsvswidewide--band)band)Same spoken style as training dataSame spoken style as training dataCooperative usersCooperative usersDoesn’t work well whenDoesn’t work well whenDifferent speaking style (conversation/lecture)Different speaking style (conversation/lecture)Shouting whisperingShouting whisperingSpeaker has a coldSpeaker has a coldDifferent languageDifferent languageSpeaker ID SystemsTrainingTrainingExample speech from each speakerExample speech from each speakerBuild models for each speakerBuild models for each speaker(maybe an exception model too)(maybe an exception model too)ID phaseID phaseCompare test speech to each modelCompare test speech to each modelChoose “closest” model (or none)Choose “closest” model (or none)Basic Speaker ID systemAccuracyWorks well on smaller setsWorks well on smaller sets2020--50 speakers50 speakersAs number of speakers increaseAs number of speakers increaseModels begin to overlap Models begin to overlap ––confuse speakersconfuse speakersWhat can we do to get better distinctionsWhat can we do to get better distinctionsWhat about transitionsNot just modeling isolates framesNot just modeling isolates framesLook at phone sequencesLook at phone sequencesBut ASRBut ASRLots of variationLots of variationLimited amount of phonetic spaceLimited amount of phonetic spaceWhat about lots of ASR enginesWhat about lots of ASR enginesPhone-based Speaker IDUse *lots* of ASR enginesUse *lots* of ASR enginesBut they need to be different ASR enginesBut they need to be different ASR enginesUse ASR engines from lots of different Use ASR engines from lots of different languageslanguagesIt doesn’t matter what language the speech isIt doesn’t matter what language the speech isUse many different ASR enginesUse many different ASR enginesGives lots of variationGives lots of variationBuild models of what phones are Build models of what phones are recognized recognized Actually we use HMM states not phonesActually we use HMM states not phonesPhone-based SID (Jin)Phone-based Speaker IDMuch better distinctions for larger datasetsMuch better distinctions for larger datasetsCan work with 100 plus voicesCan work with 100 plus voicesSlightly more robust across styles/channelsSlightly more robust across styles/channelsBut we need more …Combined modelsCombined modelsGMM modelsGMM modelsPhPh--based modelsbased modelsCombine themCombine themSlightly better resultsSlightly better resultsWhat else …What else …Prosody (duration and F0)Prosody (duration and F0)Can VC beat Speaker-IDCan we fake voices?Can we fake voices?Can we fool Speaker ID systems?Can we fool Speaker ID systems?Can we make lots of money out of it?Can we make lots of money out of it?Yes to the first twoYes to the first twoJin, Jin, TothToth, Black and Schultz ICASSP2008, Black and Schultz ICASSP2008Training/Testing CorpusLDC CSRLDC CSR--I (WSJ0)I (WSJ0)US English studio read speech US English studio read speech 24 Male speakers24 Male speakers50 sentences training, 5 test 50 sentences training, 5 test Plus 40 additional training sentencesPlus 40 additional training sentencesSentence average length is 7s.Sentence average length is 7s.VT Source speakersVT Source speakersKal_diphoneKal_diphone(synthetic speech)(synthetic speech)US English male natural speaker (not all sentences)US English male natural speaker (not all sentences)Experiment IVT GMMVT GMMKal_diphoneKal_diphonesource
View Full Document