1CS 294-5: StatisticalNatural Language ProcessingSpeech SynthesisLecture 22: 12/4/05Slides directly from Dan Jurafsky, indirectly many othersModern TTS systems 1960’s first full TTS Umeda et al (1968) 1970’s Joe Olive 1977 concatenation of linear-prediction diphones Speak and Spell 1980’s 1979 MIT MITalk (Allen, Hunnicut, Klatt) 1990’s-prese n t Diphone synthesis Unit selection synthesisTypes of Modern Synthesis Articulatory Synthesis: Model movements of articulators and acoustics of vocal tract Formant Synthesis: Start with acoustics, create rules/filters to create each formant Concatenative Synthesis: Use databases of stored speech to assemble new utterances.Text from Richard Sproat slidesTTS Demos (Mostly Unit-Selection) Comparisons: http://www.tmaa.com/tts/companies.htm ATT: http://www.naturalvoices.att.com/demos/ Rhetorical (= Scansoft) http://www.rhetorical.com/cgi-bin/demo.cgi Festival http://www-2.cs.cmu.edu/~awb/festival_demos/index.html IBM http://www-306.ibm.com/software/pervasive/tech/demos/tts.shtmlTTS ArchitecturePhonetic AnalysisDictionary LookupGrapheme-to-Phoneme (LTS)Text AnalysisText NormalizationPart-of-Speech taggingHomonym DisambiguationProsodic AnalysisBoundary placementPitch accent assignmentDuration computationWaveform synthesisRawText inSpeech outText Normalization Analysis of raw text into pronounceable words Sample problems: He stole $100 million from the bank It's 13 St. Andrews St. The home page is http://www.cnn.com yes, see you the following tues, that's 11/12/01 Steps Identify tokens in text Chunk tokens into reasonably sized sections Map tokens to words Identify types for words2Words to Phones Two methods: Dictionary-based Rule-based (Letter-to-sound=LTS) Early systems, all LTS MITalk was radical in having huge 10K word dictionary Now systems use a combination Big dictionary Special code for handling names Machine learned LTS system for other unknown words CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/cmudictLetter-to-Sound Rules Festival LTS rules:(LEFTCONTEXT [ ITEMS] RIGHTCONTEXT = NEWITEMS ) Examples: ( # [ c h ] C = k ) ( # [ c h ] = ch ) Rules apply in order “christmas” pronounced with [k] But word with ch followed by non-consonant pronounced [ch] E.g., “choice” More modern approach: learn HMMs / CRFsProsody Prosody: Getting from words+phones to boundaries, accent, F0, duration Prosodic phrasing Need to break utterances into phrases Punctuation is useful, not sufficient Accents: Predictions of accents: which syllables should be accented Realization of F0 contour: given accents/tones, generate F0 contour Duration: Predicting duration of each phoneThree aspects of prosody Prominence: some syllables/words are more prominent than others Structure/boundaries: sentences have prosodic structure Some words group naturally together Others have a noticeable break or disjuncture between them Tune: the intonational melody of an utterance.From Ladd (1996)Prominence: Pitch AccentsA: What types of foods are a good source of vitamins?B1: Legumes are a good source of VITAMINS.B2: LEGUMES are a good source of vitamins.• Prominent syllables are:• Louder• Longer• Have higher F0 and/or sharper changes in F0 (higher F0 velocity)Slide from Jennifer Vendittilegumes are a good source of VITAMINS50100150200250300350400Graphic representation of F0timeF0 (in Hertz)Slide from Jennifer Venditti3legumes are a good source of VITAMINS[ t ][ s ] [ s ]50100150200250300350400The ‘ripples’F0 is not defined for consonants without vocalfold vibration.Slide from Jennifer Vendittilegumes are a good source of VITAMINS[ v ][ g ] [ g ][ z ]50100150200250300350400The ‘ripples’... and F0 can be perturbed by consonants withan extreme constriction in the vocal tract.Slide from Jennifer Vendittilegumes are a good source of VITAMINS50100150200250300350400Abstraction of the F0 contourOur perception of the intonation contour abstracts away from these perturbations.Slide from Jennifer Vendittilegumes are a good source of VITAMINS50100150200250300350400The ‘waves’ and the ‘swells’ ‘wave’ = accent‘swell’ = phraseSlide from Jennifer VendittiStress vs. Accent Stress is a structural property of a word — it marks a potential (arbitrary) location for an accent to occur, if there is one. Accent is a property of a word in context — it is a way to mark intonational prominence in order to ‘highlight’ important words in the discourse.syllablesfull vowelsstressed syll(accented syll)lixforxxx(x)niaCaminstavixxxxxxxx(x)Slide from Jennifer VendittiWhich Word is Accented? It depends on the context. For example, the ‘new’ information in the answer to a question is often accented, while the ‘old’ information usually is not. Q1: What types of foods are a good source of vitamins? A1: LEGUMES are a good source of vitamins. Q2: Are legumes a source of vitamins? A2: Legumes are a GOOD source of vitamins. Q3: I’ve heard that legumes are healthy, but what are they a good source of ? A3: Legumes are a good source of VITAMINS.Slide from Jennifer Venditti450100150200250300350400Same ‘tune’, different alignmentLEGUMES are a good source of vitaminsThe main rise-fall accent (= “I assert this”) shifts locations.Slide from Jennifer Venditti50100150200250300350400Same ‘tune’, different alignmentLegumes are a GOOD source of vitaminsThe main rise-fall accent (= “I assert this”) shifts locations.Slide from Jennifer VendittiSame ‘tune’, different alignmentlegumes are a good source of VITAMINS50100150200250300350400The main rise-fall accent (= “I assert this”) shifts locations.Slide from Jennifer VendittiBroad focuslegumes are a good source of vitamins“Tell me something about the world.”In the absence of narrow focus, English tends to mark the firstand last ‘content’ words with perceptually prominent accents.50100150200250300350400Slide from Jennifer VendittiYes-No question tuneare LEGUMES a good source of vitaminsRise from the main accent to the end of the sentence.50100150200250300350400450500550Slide from Jennifer VendittiYes-No question tuneare legumes a GOOD source of vitaminsRise from the main accent to the end of the sentence.50100150200250300350400450500550Slide from
View Full Document