Unformatted text preview:

Back End Synthesis Julia Hirschberg CS 4706 Thanks to Dan and Jim Architectures of Modern Synthesis Articulatory Synthesis Model movements of articulators and acoustics of vocal tract Formant Synthesis Start with acoustics create rules filters to create each formant Concatenative Synthesis Use databases of stored speech to assemble new utterances HMM Synthesis 01 15 2019 Text from Richard Sproat slides Speech and Language Processing Jurafsky and Martin 2 Formant Synthesis Most common commercial systems while computers relatively underpowered 1979 MIT MITalk Allen Hunnicut Klatt 1983 DECtalk system Voice of Stephen Hawking 01 15 2019 Speech and Language Processing Jurafsky and Martin 3 Concatenative Synthesis All current commercial systems Diphone Synthesis Units are diphones middle of one phone to middle of next Why Middle of phone is steady state Record 1 speaker saying each diphone Unit Selection Synthesis Larger units Record 10 hours or more so have multiple copies of each unit Use search to find best sequence of units 01 15 2019 Speech and Language Processing Jurafsky and Martin 4 TTS Demos all Unit Selection Festival http www 2 cs cmu edu awb festival demos index html Cepstral http www cepstral com cgi bin demos genera l AT T http www2 research att com ttsweb tts dem o php 01 15 2019 5 How do we get from Text to Speech TTS Backend takes segments f0 duration and creates a waveform A full system needs to go all the way from random text to sound 01 15 2019 6 Front End and Back End PG E will file schedules on April 20 TEXT ANALYSIS Text to intermediate representation WAVEFORM SYNTHESIS From intermediate representation to waveform 01 15 2019 Speech and Language Processing Jurafsky and Martin 7 The Hourglass 01 15 2019 Speech and Language Processing Jurafsky and Martin 8 Waveform Synthesis Given String of phones Prosody Desired F0 for entire utterance Duration for each phone Stress value for each phone possibly accent value Generate Waveforms 01 15 2019 Speech and Language Processing Jurafsky and Martin 9 Diphone TTS Architecture Training Choose units kinds of diphones Record 1 speaker saying at least 1 example of each Mark boundaries and segment to create diphone database Synthesizing from diphones Select relevant set of diphones from database Concatenate them in order doing minor signal processing at boundaries Use signal processing techniques to change prosody F0 energy duration of sequence 01 15 2019 Speech and Language Processing Jurafsky and Martin 10 Diphones Where is the stable region 01 15 2019 Speech and Language Processing Jurafsky and Martin 11 Diphone Database Middle of phone more stable than edges Need O phone2 number of units Some phone phone sequences don t exist ATT Olive et al 98 system had 43 phones 1849 possible diphones but only 1172 actual Phonotactics h only occurs before vowels Don t need diphones across silence But may want to include stress or accent differences consonant clusters etc Requires much knowledge of phonetics in design Database relatively small by today s standards Around 8 megabytes for English 16 KHz 16 bit 01 15 2019 12 Voice Speaker Called voice talent How to choose Diphone database Called a voice Modern TTS systems have multiple voices 01 15 2019 Speech and Language Processing Jurafsky and Martin 13 Prosodic Modification Modifying pitch and duration independently Changing sample rate modifies both Chipmunk speech Duration duplicate remove parts of the signal Pitch re sample to change pitch 01 15 2019 Text from Alan Black 14 Speech as Sequence of Short Term Signals 01 15 2019 Alan Black Speech and Language Processing Jurafsky and Martin 15 Duration Modification Duplicate remove short term signals 01 15 2019 Slide from Richard Sproat 16 Pitch Modification Move short term signals closer together further apart more cycles per sec means higher pitch and vice versa Add frames as needed to maintain desired duration 01 15 2019 Slide from Richard Sproat Speech and Language Processing Jurafsky and Martin 18 TD PSOLA Time Domain Pitch Synchronous Overlap and Add Patented by France Telecom CNET Epoch detection and windowing Pitch synchronous Overlap and add Very efficient Can modify Hz up to two times or by half Smoother transitions 01 15 2019 Speech and Language Processing Jurafsky and Martin 19 Unit Selection Synthesis Generalization of the diphone intuition Larger units From diphones to phrases to sentences Record many copies of each unit E g 10 hours of speech instead of 1500 diphones a few minutes of speech Label diphones and their midpoints 01 15 2019 20 Unit Selection Intuition Given a large labeled database find the unit that best matches the desired synthesis specification What does best mean Target cost Find closest match in terms of Phonetic context F0 stress phrase position Join cost Find best join with neighboring units Matching formants other spectral characteristics Matching energy Matching F0 01 15 2019 Speech and Language Processing Jurafsky and Martin 21 Targets and Target Costs Target cost C t u How well does target specification t match db unit u Goal find unit least unlike target Examples of labeled diphone midpoints ih t stress phrase internal high F0 content word n t stress phrase final high F0 function word dh ax stress phrase initial low F0 word the Costs of different features have different weights 01 15 2019 22 Target Costs Comprised of p weighted subcosts Stress Phrase position F0 Phone duration Lexical identity Target cost for a unit p C t ti ui wktCkt ti ui k 1 01 15 2019 23 Join Concatenation Cost Measure of smoothness of join between two database units ui and uj target irrelevant Features costs and weights Comprised of p weighted subcosts Spectral features F0 Energy p Join cost C j ui 1 ui wkj Ckj ui 1 ui k 1 01 15 2019 24 Total Costs Hunt and Black 1996 We now have weights per phone type for features set between target and database units Find best path of units through database that minimize n n C t u C n 1 n 1 n 1 target i 1 ti ui C n 1 join ui 1 ui i 2 n 1 argmin C t u u u1 un Standard problem solvable with Viterbi search with beam width constraint for pruning 01 15 2019 Slide from Paul Taylor Speech and Language Processing Jurafsky and Martin 25 Synthesizing 01 15 2019 Speech and Language Processing Jurafsky and Martin 26 Unit Selection Summary Advantages Quality far superior to diphones fewer joins more choices of units Natural prosody selection sounds better Disadvantages Quality very bad when no good match in


View Full Document

Columbia CS 4706 - backend-synthesis

Loading Unlocking...
Login

Join to view backend-synthesis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view backend-synthesis and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?