CMU CS 15492 - vc2 - D1940588

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15492> vc2

DOC PREVIEW

CMU CS 15492 - vc2

School name Carnegie Mellon University

Course Cs 15492- Special Topic: Speech Processing

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Speech Processing 15-492/18-492Voice Conversion 2De-identificationRemove speaker identityRemove speaker identityBut keep it still human likeBut keep it still human likeHealth RecordsHealth RecordsHIPAA laws require thisHIPAA laws require thisNot just removing names and Not just removing names and SSNsSSNsUse Voice conversion to get “new” voicesUse Voice conversion to get “new” voicesDe-identificationBest would be ASR to text and TTSBest would be ASR to text and TTSBut it would all sound the sameBut it would all sound the sameAnd it would loose spontaneity And it would loose spontaneity VC to some example speakerVC to some example speakerBut then you’d need lots of example speakersBut then you’d need lots of example speakersAnd you can still detect some properties of And you can still detect some properties of source speakersource speakerDe-identification by GMM VCUsing standard GMM VCUsing standard GMM VCCan still identify 50% of the voices (out of 24)Can still identify 50% of the voices (out of 24)(Human’s cant but machine can)(Human’s cant but machine can)Need something more extremeNeed something more extremeGMM VC plus duration Find the average duration of source and Find the average duration of source and target speakerstarget speakersModify length of speech by factorModify length of speech by factorHas to be overall factor as no phoneme Has to be overall factor as no phoneme information is availableinformation is availableFor deFor de--identificationidentification< 30% still identified< 30% still identifiedNeed to be more extremeNeed to be more extremeGMM VC + dur + TransterpolateWhat about What about transterpolationtransterpolation> 1.0 ?> 1.0 ?Certainly gives another voiceCertainly gives another voiceMoved further away from sourceMoved further away from sourceOriginal Original GMM to GMM to kalkalGMM to GMM to kalkaltrans 1.2trans 1.2GMM to GMM to kalkaltrans 2.0trans 2.0VC and evaluationHow can tell if VC works?How can tell if VC works?Ask peopleAsk peopleDoes it sound like the targetDoes it sound like the target(Does it not sound like the source)(Does it not sound like the source)Use objective metricsUse objective metricsSome automatic scoreSome automatic scoreHuman listening testsABX testsABX testsSource, Target and transformed speechSource, Target and transformed speechDoes X sound more like A or BDoes X sound more like A or BAX testsAX testsSource/Target and transformed speechSource/Target and transformed speechWere A and X produced by same or different speakersWere A and X produced by same or different speakersOver multiple listeners you can get consensusOver multiple listeners you can get consensusNote different results for ANote different results for A-->B than B>B than B-->A >AWhat if you know the speakers?Test with CMU voices Test with CMU voices CMU listenersCMU listenersNonNon--CMU listenersCMU listenersResults are still basically the sameResults are still basically the sameVoices that convert better are the sameVoices that convert better are the sameKnow the speakers doesn’t make a differenceKnow the speakers doesn’t make a differenceObjective Measures• Need to have a automatic measure too• Mel-Cepstral Distortion– Euclidean distance between MFCC – Lower order MFCCs have bigger magnitudes– Thus this is scaled to favor lower order MFCCs– Scaling factor is to make it a nice number• Results are between 3.5 and 6.5 (smaller is better)Cross Lingual Voice ConversionHave your voice in another languageHave your voice in another languageSpeech to Speech translations systemsSpeech to Speech translations systemsIts OK to be nonIts OK to be non--native accentednative accentedBut we need parallel dataBut we need parallel dataBut I can’t speak XBut I can’t speak XFake it but it should *very* accentedFake it but it should *very* accentedUse source phones, but it sound *very* *Use source phones, but it sound *very* *veryvery* * accentedaccentedCross-Lingual VCFind a bilingual speakerFind a bilingual speakerA(germanA(german) and ) and A(englishA(english))For your nonFor your non--bibi--lingual speaker lingual speaker B(englishB(english))Build Build A(englishA(english))-->>B(englishB(english) VC model) VC modelFor crossFor cross--lingual lingual B(germanB(german))A(germanA(german) plus ) plus A(englishA(english))-->>B(englishB(english))Sort of works, but not very wellSort of works, but not very wellCross-lingual VCA voice has both A voice has both Speaker specific componentsSpeaker specific componentsLanguage specific componentsLanguage specific componentsFor CLVC you want to separate theseFor CLVC you want to separate theseHow do you evaluate it?How do you evaluate it?With bilingual speakerWith bilingual speakerWith human listeners With human listeners Does it have an accentDoes it have an accentWhere is this person fromWhere is this person fromIs this the same person speakingIs this the same person speakingBackwards speechPlaying speech backwardsPlaying speech backwardsStill has speaker propertiesStill has speaker propertiesCan be language independentCan be language independentSpeaker 1Speaker 1Speaker 2Speaker 2Speaker 2 or 1 Speaker 2 or 1New Language with VCIn order to build support in new languagesIn order to build support in new languagesUse existing language databasesUse existing language databasesFind “similar” phones in different languagesFind “similar” phones in different languagesSynthesis with these “similar” phones”Synthesis with these “similar” phones”Collect parallel dataCollect parallel dataXenophoneXenophonesynthesis (X)synthesis (X)Native speaker (N)Native speaker (N)Build VC between XBuild VC between X-->N>NUse as filter on Use as filter on xenophonexenophonesynthesissynthesisSort of worksSort of worksSome people don’t do the VC stage!Some people don’t do the VC stage!VC factorsCan be done in real timeCan be done in real timeDelay needn’t be more than

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 19 pages.

CMU CS 15492 - vc2

Sign up for free to view:

Please select your school