DOC PREVIEW
CMU CS 15492 - vc2

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Speech Processing 15-492/18-492Voice Conversion 2De-identificationRemove speaker identityRemove speaker identityBut keep it still human likeBut keep it still human likeHealth RecordsHealth RecordsHIPAA laws require thisHIPAA laws require thisNot just removing names and Not just removing names and SSNsSSNsUse Voice conversion to get “new” voicesUse Voice conversion to get “new” voicesDe-identificationBest would be ASR to text and TTSBest would be ASR to text and TTSBut it would all sound the sameBut it would all sound the sameAnd it would loose spontaneity And it would loose spontaneity VC to some example speakerVC to some example speakerBut then you’d need lots of example speakersBut then you’d need lots of example speakersAnd you can still detect some properties of And you can still detect some properties of source speakersource speakerDe-identification by GMM VCUsing standard GMM VCUsing standard GMM VCCan still identify 50% of the voices (out of 24)Can still identify 50% of the voices (out of 24)(Human’s cant but machine can)(Human’s cant but machine can)Need something more extremeNeed something more extremeGMM VC plus duration Find the average duration of source and Find the average duration of source and target speakerstarget speakersModify length of speech by factorModify length of speech by factorHas to be overall factor as no phoneme Has to be overall factor as no phoneme information is availableinformation is availableFor deFor de--identificationidentification< 30% still identified< 30% still identifiedNeed to be more extremeNeed to be more extremeGMM VC + dur + TransterpolateWhat about What about transterpolationtransterpolation> 1.0 ?> 1.0 ?Certainly gives another voiceCertainly gives another voiceMoved further away from sourceMoved further away from sourceOriginal Original GMM to GMM to kalkalGMM to GMM to kalkaltrans 1.2trans 1.2GMM to GMM to kalkaltrans 2.0trans 2.0VC and evaluationHow can tell if VC works?How can tell if VC works?Ask peopleAsk peopleDoes it sound like the targetDoes it sound like the target(Does it not sound like the source)(Does it not sound like the source)Use objective metricsUse objective metricsSome automatic scoreSome automatic scoreHuman listening testsABX testsABX testsSource, Target and transformed speechSource, Target and transformed speechDoes X sound more like A or BDoes X sound more like A or BAX testsAX testsSource/Target and transformed speechSource/Target and transformed speechWere A and X produced by same or different speakersWere A and X produced by same or different speakersOver multiple listeners you can get consensusOver multiple listeners you can get consensusNote different results for ANote different results for A-->B than B>B than B-->A >AWhat if you know the speakers?Test with CMU voices Test with CMU voices CMU listenersCMU listenersNonNon--CMU listenersCMU listenersResults are still basically the sameResults are still basically the sameVoices that convert better are the sameVoices that convert better are the sameKnow the speakers doesn’t make a differenceKnow the speakers doesn’t make a differenceObjective Measures• Need to have a automatic measure too• Mel-Cepstral Distortion– Euclidean distance between MFCC – Lower order MFCCs have bigger magnitudes– Thus this is scaled to favor lower order MFCCs– Scaling factor is to make it a nice number• Results are between 3.5 and 6.5 (smaller is better)Cross Lingual Voice ConversionHave your voice in another languageHave your voice in another languageSpeech to Speech translations systemsSpeech to Speech translations systemsIts OK to be nonIts OK to be non--native accentednative accentedBut we need parallel dataBut we need parallel dataBut I can’t speak XBut I can’t speak XFake it but it should *very* accentedFake it but it should *very* accentedUse source phones, but it sound *very* *Use source phones, but it sound *very* *veryvery* * accentedaccentedCross-Lingual VCFind a bilingual speakerFind a bilingual speakerA(germanA(german) and ) and A(englishA(english))For your nonFor your non--bibi--lingual speaker lingual speaker B(englishB(english))Build Build A(englishA(english))-->>B(englishB(english) VC model) VC modelFor crossFor cross--lingual lingual B(germanB(german))A(germanA(german) plus ) plus A(englishA(english))-->>B(englishB(english))Sort of works, but not very wellSort of works, but not very wellCross-lingual VCA voice has both A voice has both Speaker specific componentsSpeaker specific componentsLanguage specific componentsLanguage specific componentsFor CLVC you want to separate theseFor CLVC you want to separate theseHow do you evaluate it?How do you evaluate it?With bilingual speakerWith bilingual speakerWith human listeners With human listeners Does it have an accentDoes it have an accentWhere is this person fromWhere is this person fromIs this the same person speakingIs this the same person speakingBackwards speechPlaying speech backwardsPlaying speech backwardsStill has speaker propertiesStill has speaker propertiesCan be language independentCan be language independentSpeaker 1Speaker 1Speaker 2Speaker 2Speaker 2 or 1 Speaker 2 or 1New Language with VCIn order to build support in new languagesIn order to build support in new languagesUse existing language databasesUse existing language databasesFind “similar” phones in different languagesFind “similar” phones in different languagesSynthesis with these “similar” phones”Synthesis with these “similar” phones”Collect parallel dataCollect parallel dataXenophoneXenophonesynthesis (X)synthesis (X)Native speaker (N)Native speaker (N)Build VC between XBuild VC between X-->N>NUse as filter on Use as filter on xenophonexenophonesynthesissynthesisSort of worksSort of worksSome people don’t do the VC stage!Some people don’t do the VC stage!VC factorsCan be done in real timeCan be done in real timeDelay needn’t be more than


View Full Document
Download vc2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view vc2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view vc2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?