DOC PREVIEW
Columbia COMS W4706 - Back-End Synthesis and Evaluation

This preview shows page 1-2-17-18-19-35-36 out of 36 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Back-End Synthesis and EvaluationJulia HirschbergCS 4706(*Thanks to Dan and Jim)Outline• Waveform Generation• Diphones• Unit Selection• HMM Synthesis• TTS Evaluation• Objective Measures• Subjective Measures3/10/2010 2Speech and Language Processing Jurafsky and MartinModern TTS systems• 1960’s first full TTS: Umeda et al (1968)• 1970’s– Joe Olive 1977 concatenation of linear-prediction diphones– Speak and Spell• 1980’s– 1979 MIT MITalk (Allen, Hunnicut, Klatt)• 1990’s-present– Diphone synthesis– Unit selection synthesis– HMM synthesis3/10/2010 3Speech and Language Processing Jurafsky and MartinArchitectures of Modern Synthesis• Articulatory Synthesis:– Model movements of articulators and acoustics of vocal tract• Formant Synthesis:– Start with acoustics, create rules/filters to create each formant• Concatenative Synthesis:– Use databases of stored speech to assemble new utterances.•HMM SynthesisText from Richard Sproat slides3/10/2010 4Speech and Language Processing Jurafsky and MartinFormant Synthesis• Were the most common commercial systems while computers were relatively underpowered.• 1979 MIT MITalk (Allen, Hunnicut, Klatt)• 1983 DECtalk system• The voice of Stephen Hawking3/10/2010 5Speech and Language Processing Jurafsky and MartinConcatenative Synthesis• All current commercial systems.• Diphone Synthesis – Units are diphones; middle of one phone to middle of next.– Why? Middle of phone is steady state.– Record 1 speaker saying each diphone• Unit Selection Synthesis – Larger units– Record 10 hours or more, so have multiple copies of each unit– Use search to find best sequence of units3/10/2010 6Speech and Language Processing Jurafsky and MartinTTS Demos (all are Unit-Selection)• Festival– http://www-2.cs.cmu.edu/~awb/festival_demos/index.html•Cepstral– http://www.cepstral.com/cgi-bin/demos/general•IBM– http://www-306.ibm.com/software/pervasive/tech/demos/tts.shtml3/10/2010 7Speech and Language Processing Jurafsky and MartinHow do we get from Text to Speech?• TTS “Backend” only covers the segments+f0+duration to waveform part• A full system needs to go all the way from random text to sound3/10/2010 8Speech and Language Processing Jurafsky and MartinTwo steps• PG&E will file schedules on April 20.• TEXT ANALYSIS: Text into intermediate representation:• WAVEFORM SYNTHESIS: From the intermediate representation into waveform3/10/2010 9Speech and Language Processing Jurafsky and MartinThe Hourglass 3/10/2010 10Speech and Language Processing Jurafsky and MartinWaveform Synthesis•Given:– String of phones–Prosody• Desired F0 for entire utterance• Duration for each phone• Stress value for each phone, possibly accent value• Generate:– Waveforms3/10/2010 11Speech and Language Processing Jurafsky and MartinDiphone TTS Architecture• Training:– Choose units (kinds of diphones)– Record 1 speaker saying 1 example of each diphone– Mark the boundaries of each diphone, • Cut each diphone out to create a diphonedatabase• Synthesizing an utterance, – Select relevant set of diphones from database– Concatenate them in order, doing minor signal processing at boundaries– Use signal processing techniques to change prosody (F0, energy, duration) of sequence3/10/2010 12Speech and Language Processing Jurafsky and MartinDiphones• Where is the stable region?3/10/2010 13Speech and Language Processing Jurafsky and MartinDiphones• Middle of phone more stable than edges• Need O(phone2) number of units– Some phone-phone sequences don’t exist– ATT (Olive et al.’98) system had 43 phones• 1849 possible diphones• Phonotactics: ([h] only occurs before vowels), don’t need to keep diphones across silence • Only 1172 actual diphones– But…may want to include stress or accent differences, consonant clusters, etc., so may need more– Requires much knowledge of phonetics in design• Database relatively small (by today’s standards)– Around 8 megabytes for English (16 KHz 16 bit)Slide from Richard Sproat3/10/2010 14Speech and Language Processing Jurafsky and MartinVoice• Speaker– Called the voice talent– How to choose?• Diphone database– Called a voice– Modern TTS systems have multiple voices3/10/2010 15Speech and Language Processing Jurafsky and MartinProsodic Modification• Modifying pitch and duration independently• Changing sample rate modifies both:– Chipmunk speech• Duration: duplicate/remove parts of the signal•Pitch: resample to change pitchText from Alan Black3/10/2010 16Speech and Language Processing Jurafsky and MartinSpeech as Short Term signalsAlan Black3/10/2010 17Speech and Language Processing Jurafsky and MartinDuration modification• Duplicate/remove short term signalsSlide from Richard Spro3/10/2010 18Speech and Language Processing Jurafsky and MartinDuration modification• Duplicate/remove short term signals3/10/2010 19Speech and Language Processing Jurafsky and MartinPitch Modification• Move short-term signals closer together/further apart: more cycles per secmeans higher pitch and vice versa• Add frames as needed to maintain desired durationSlide from Richard Sproat3/10/2010 20Speech and Language Processing Jurafsky and MartinTD-PSOLA ™• Time-Domain Pitch Synchronous Overlap and Add• Patented by France Telecom (CNET)• Epoch detection and windowing• Pitch-synchronous• Overlap-and-add• Very efficient• Can modify Hz up to two times or by half3/10/2010 21Speech and Language Processing Jurafsky and MartinUnit Selection Synthesis• Generalization of the diphone intuition– Larger units • From diphones to sentences– Record many copies of each unit• E.g.10 hours of speech instead of 1500 diphones(a few minutes of speech)3/10/2010 22Speech and Language Processing Jurafsky and MartinUnit Selection Intuition• Given a large labeled database, find the unit that best matches the desired synthesis specification• What does “best” mean?– Target cost: Find closest match in terms of• Phonetic context• F0, stress, phrase position– Join cost: Find best join with neighboring units• Matching formants + other spectral characteristics• Matching energy• Matching F03/10/2010 23Speech and Language Processing Jurafsky and MartinTargets and Target Costs• Target cost T(ut,st): How well does the target specification st match the potential unit in the database


View Full Document

Columbia COMS W4706 - Back-End Synthesis and Evaluation

Download Back-End Synthesis and Evaluation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Back-End Synthesis and Evaluation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Back-End Synthesis and Evaluation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?