DOC PREVIEW
CMU CS 15492 - Speaker ID

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Speech Processing 15-492/18-492Speaker IDWho is speaking?Speaker ID, Speaker RecognitionSpeaker ID, Speaker RecognitionWhen do you use itWhen do you use itSecurity, AccessSecurity, AccessSpeaker specific modelingSpeaker specific modelingRecognize the speaker and use their optionsRecognize the speaker and use their optionsDiacritizationDiacritizationIn multiIn multi--speaker environmentsspeaker environmentsAssign speech to different peopleAssign speech to different peopleAllow questions like did Fred agree or not.Allow questions like did Fred agree or not.Voice IdentityWhat makes a voice identityWhat makes a voice identityLexical Choice: Lexical Choice: WooWoo--hoohoo, , I pity the fool …I pity the fool …Phonetic choicePhonetic choiceIntonation and durationIntonation and durationSpectral qualities (vocal tract shape)Spectral qualities (vocal tract shape)ExcitationExcitationVoice IdentityWhat makes a voice identityWhat makes a voice identityLexical Choice: Lexical Choice: WooWoo--hoohoo, , I pity the fool …I pity the fool …Phonetic choicePhonetic choiceIntonation and durationIntonation and durationSpectral qualities (vocal tract shape)Spectral qualities (vocal tract shape)ExcitationExcitationBut which is most discriminative?But which is most discriminative?GMM Speaker ID Just looking at spectral partJust looking at spectral partWhich is sort of vocal tract shapeWhich is sort of vocal tract shapeBuild a single Gaussian of Build a single Gaussian of MFCCsMFCCsMeans and Standard Deviation of all speechMeans and Standard Deviation of all speechActually build NActually build N--mixture Gaussian (32 or 64)mixture Gaussian (32 or 64)Build a model for each speakerBuild a model for each speakerUse test data and see which model its Use test data and see which model its closest toclosest toGMM Speaker IDHow close does it need to be?How close does it need to be?One or two standard deviations?One or two standard deviations?The set of speakers needs to be differentThe set of speakers needs to be differentIf they are closest than one or two If they are closest than one or two stddevstddevYou get confusion.You get confusion.Should you have a “general” modelShould you have a “general” modelNot one of the set of training speakersNot one of the set of training speakersGMM Speaker IDWorks well on constrained tasksWorks well on constrained tasksIn similar acoustic conditionsIn similar acoustic conditions(not phone (not phone vsvswidewide--band)band)Same spoken style as training dataSame spoken style as training dataCooperative usersCooperative usersDoesn’t work well whenDoesn’t work well whenDifferent speaking style (conversation/lecture)Different speaking style (conversation/lecture)Shouting whisperingShouting whisperingSpeaker has a coldSpeaker has a coldDifferent languageDifferent languageSpeaker ID SystemsTrainingTrainingExample speech from each speakerExample speech from each speakerBuild models for each speakerBuild models for each speaker(maybe an exception model too)(maybe an exception model too)ID phaseID phaseCompare test speech to each modelCompare test speech to each modelChoose “closest” model (or none)Choose “closest” model (or none)Basic Speaker ID systemAccuracyWorks well on smaller setsWorks well on smaller sets2020--50 speakers50 speakersAs number of speakers increaseAs number of speakers increaseModels begin to overlap Models begin to overlap ––confuse speakersconfuse speakersWhat can we do to get better distinctionsWhat can we do to get better distinctionsWhat about transitionsNot just modeling isolates framesNot just modeling isolates framesLook at phone sequencesLook at phone sequencesBut ASRBut ASRLots of variationLots of variationLimited amount of phonetic spaceLimited amount of phonetic spaceWhat about lots of ASR enginesWhat about lots of ASR enginesPhone-based Speaker IDUse *lots* of ASR enginesUse *lots* of ASR enginesBut they need to be different ASR enginesBut they need to be different ASR enginesUse ASR engines from lots of different Use ASR engines from lots of different languageslanguagesIt doesn’t matter what language the speech isIt doesn’t matter what language the speech isUse many different ASR enginesUse many different ASR enginesGives lots of variationGives lots of variationBuild models of what phones are Build models of what phones are recognized recognized Actually we use HMM states not phonesActually we use HMM states not phonesPhone-based SID (Jin)Phone-based Speaker IDMuch better distinctions for larger datasetsMuch better distinctions for larger datasetsCan work with 100 plus voicesCan work with 100 plus voicesSlightly more robust across styles/channelsSlightly more robust across styles/channelsBut we need more …Combined modelsCombined modelsGMM modelsGMM modelsPhPh--based modelsbased modelsCombine themCombine themSlightly better resultsSlightly better resultsWhat else …What else …Prosody (duration and F0)Prosody (duration and F0)Can VC beat Speaker-IDCan we fake voices?Can we fake voices?Can we fool Speaker ID systems?Can we fool Speaker ID systems?Can we make lots of money out of it?Can we make lots of money out of it?Yes to the first twoYes to the first twoJin, Jin, TothToth, Black and Schultz ICASSP2008, Black and Schultz ICASSP2008Training/Testing CorpusLDC CSRLDC CSR--I (WSJ0)I (WSJ0)US English studio read speech US English studio read speech 24 Male speakers24 Male speakers50 sentences training, 5 test 50 sentences training, 5 test Plus 40 additional training sentencesPlus 40 additional training sentencesSentence average length is 7s.Sentence average length is 7s.VT Source speakersVT Source speakersKal_diphoneKal_diphone(synthetic speech)(synthetic speech)US English male natural speaker (not all sentences)US English male natural speaker (not all sentences)Experiment IVT GMMVT GMMKal_diphoneKal_diphonesource


View Full Document
Download Speaker ID
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Speaker ID and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Speaker ID 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?