Speech Processing 15-492/18-492Voice Conversion 2De-identificationRemove speaker identityRemove speaker identityBut keep it still human likeBut keep it still human likeHealth RecordsHealth RecordsHIPAA laws require thisHIPAA laws require thisNot just removing names and Not just removing names and SSNsSSNsUse Voice conversion to get “new” voicesUse Voice conversion to get “new” voicesDe-identificationBest would be ASR to text and TTSBest would be ASR to text and TTSBut it would all sound the sameBut it would all sound the sameAnd it would loose spontaneity And it would loose spontaneity VC to some example speakerVC to some example speakerBut then you’d need lots of example speakersBut then you’d need lots of example speakersAnd you can still detect some properties of And you can still detect some properties of source speakersource speakerDe-identification by GMM VCUsing standard GMM VCUsing standard GMM VCCan still identify 50% of the voices (out of 24)Can still identify 50% of the voices (out of 24)(Human’s cant but machine can)(Human’s cant but machine can)Need something more extremeNeed something more extremeGMM VC plus duration Find the average duration of source and Find the average duration of source and target speakerstarget speakersModify length of speech by factorModify length of speech by factorHas to be overall factor as no phoneme Has to be overall factor as no phoneme information is availableinformation is availableFor deFor de--identificationidentification< 30% still identified< 30% still identifiedNeed to be more extremeNeed to be more extremeGMM VC + dur + TransterpolateWhat about What about transterpolationtransterpolation> 1.0 ?> 1.0 ?Certainly gives another voiceCertainly gives another voiceMoved further away from sourceMoved further away from sourceOriginal Original GMM to GMM to kalkalGMM to GMM to kalkaltrans 1.2trans 1.2GMM to GMM to kalkaltrans 2.0trans 2.0VC and evaluationHow can tell if VC works?How can tell if VC works?Ask peopleAsk peopleDoes it sound like the targetDoes it sound like the target(Does it not sound like the source)(Does it not sound like the source)Use objective metricsUse objective metricsSome automatic scoreSome automatic scoreHuman listening testsABX testsABX testsSource, Target and transformed speechSource, Target and transformed speechDoes X sound more like A or BDoes X sound more like A or BAX testsAX testsSource/Target and transformed speechSource/Target and transformed speechWere A and X produced by same or different speakersWere A and X produced by same or different speakersOver multiple listeners you can get consensusOver multiple listeners you can get consensusNote different results for ANote different results for A-->B than B>B than B-->A >AWhat if you know the speakers?Test with CMU voices Test with CMU voices CMU listenersCMU listenersNonNon--CMU listenersCMU listenersResults are still basically the sameResults are still basically the sameVoices that convert better are the sameVoices that convert better are the sameKnow the speakers doesn’t make a differenceKnow the speakers doesn’t make a differenceObjective Measures• Need to have a automatic measure too• Mel-Cepstral Distortion– Euclidean distance between MFCC – Lower order MFCCs have bigger magnitudes– Thus this is scaled to favor lower order MFCCs– Scaling factor is to make it a nice number• Results are between 3.5 and 6.5 (smaller is better)Cross Lingual Voice ConversionHave your voice in another languageHave your voice in another languageSpeech to Speech translations systemsSpeech to Speech translations systemsIts OK to be nonIts OK to be non--native accentednative accentedBut we need parallel dataBut we need parallel dataBut I can’t speak XBut I can’t speak XFake it but it should *very* accentedFake it but it should *very* accentedUse source phones, but it sound *very* *Use source phones, but it sound *very* *veryvery* * accentedaccentedCross-Lingual VCFind a bilingual speakerFind a bilingual speakerA(germanA(german) and ) and A(englishA(english))For your nonFor your non--bibi--lingual speaker lingual speaker B(englishB(english))Build Build A(englishA(english))-->>B(englishB(english) VC model) VC modelFor crossFor cross--lingual lingual B(germanB(german))A(germanA(german) plus ) plus A(englishA(english))-->>B(englishB(english))Sort of works, but not very wellSort of works, but not very wellCross-lingual VCA voice has both A voice has both Speaker specific componentsSpeaker specific componentsLanguage specific componentsLanguage specific componentsFor CLVC you want to separate theseFor CLVC you want to separate theseHow do you evaluate it?How do you evaluate it?With bilingual speakerWith bilingual speakerWith human listeners With human listeners Does it have an accentDoes it have an accentWhere is this person fromWhere is this person fromIs this the same person speakingIs this the same person speakingBackwards speechPlaying speech backwardsPlaying speech backwardsStill has speaker propertiesStill has speaker propertiesCan be language independentCan be language independentSpeaker 1Speaker 1Speaker 2Speaker 2Speaker 2 or 1 Speaker 2 or 1New Language with VCIn order to build support in new languagesIn order to build support in new languagesUse existing language databasesUse existing language databasesFind “similar” phones in different languagesFind “similar” phones in different languagesSynthesis with these “similar” phones”Synthesis with these “similar” phones”Collect parallel dataCollect parallel dataXenophoneXenophonesynthesis (X)synthesis (X)Native speaker (N)Native speaker (N)Build VC between XBuild VC between X-->N>NUse as filter on Use as filter on xenophonexenophonesynthesissynthesisSort of worksSort of worksSome people don’t do the VC stage!Some people don’t do the VC stage!VC factorsCan be done in real timeCan be done in real timeDelay needn’t be more than
View Full Document