Speech Processing 15-492/18-492Speech SynthesisTalking headsSinging SynthesisMore Information is BetterVoice + text is easier to understandVoice + text is easier to understandVoice + face is easier tooVoice + face is easier tooTalking HeadsAdds novelty/character/personificationAdds novelty/character/personificationExperimentsExperimentsshow better understandingshow better understandingLip synchingLip synchingFacial movementsFacial movementsListeners swear its better synthesisListeners swear its better synthesisTalking headsTalking HeadsSynthesize textSynthesize textOutput phone position in audio streamOutput phone position in audio streamMap phones to lip/tongue positionsMap phones to lip/tongue positionsBuild visual streamBuild visual streamChoose appropriate framesChoose appropriate framesAligned with audioAligned with audioHow many facial positionsHow many facial positionsVisemesBaphyBaphyThree positionsThree positionsClosed, open and roundedClosed, open and roundedRhoRho10 lip positions10 lip positionsEyelid 4Eyelid 4Eyes 2Eyes 2When should the alignWhen should the alignFollow trajectories, not just at time instantFollow trajectories, not just at time instantShape for syllables not just phonesShape for syllables not just phonesSynthesis AnalogiesArticulatoryArticulatorySynthesisSynthesisModeling the vocal tractModeling the vocal tractBaldiBaldi: movement of muscles: movement of musclesFormat:Format:Modeling of signal syntheticallyModeling of signal syntheticallyCarton based faces (Carton based faces (BaphyBaphy))Concatenative Concatenative Joining natural segmentsJoining natural segmentsJPL exampleJPL exampleInterval’s Video RewriteInterval’s Video RewriteUnit sizeUnit sizeBaphyBaphy== == uniphoneuniphoneJPL == JPL == diphonediphoneVideo Rewrite == unit selectionVideo Rewrite == unit selectionTalking HeadsPersonalization:Personalization:Can look like a mask put on a dummyCan look like a mask put on a dummyUncanny valleyUncanny valleyThe more human like, the more critical we areThe more human like, the more critical we are33--D movement (in real time)D movement (in real time)SecondSecond--life type characterslife type charactersGesture generation tooGesture generation tooOffOff--line line (Gollum, (Gollum, JabbaJabbathe Hut)the Hut)Usually actors do the voicesUsually actors do the voicesSinging SynthesisSimple pitch and duration control Simple pitch and duration control But singing is more than that But singing is more than that Proper singing synthesisProper singing synthesisRecording a singing database Recording a singing database Phonetic, prosodic, and singing style coveragePhonetic, prosodic, and singing style coverageSang rather than spoken voiceSang rather than spoken voiceFlinger (Festival Singer) (Macon)Sinusoidal modelingSinusoidal modelingMore pitch control than just PSOLAMore pitch control than just PSOLAMIDI interfaceMIDI interfaceAllow mixing with musicAllow mixing with musicStandard MIDI authoring techniquesStandard MIDI authoring techniquesFestival Singing ModeDominic Dominic MazzoniMazzoni(11(11--752 project 2001)752 project 2001)XML based song descriptionXML based song description<DURATION BEATS=“1.0”><DURATION BEATS=“1.0”><PITCH NOTE=“C4”>Oh</PITCH><PITCH NOTE=“C4”>Oh</PITCH></DURATION></DURATION>But not just setting pitch at duration pointBut not just setting pitch at duration pointWhen do you move it (based on syllable and voicing)When do you move it (based on syllable and voicing)How quickly do you move pitchHow quickly do you move pitchSinging Example<?xml version="1.0"?><?xml version="1.0"?><!DOCTYPE SINGING PUBLIC "<!DOCTYPE SINGING PUBLIC "--//SINGING//DTD SINGING mark up//EN" //SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd""Singing.v0_1.dtd"[]>[]><SINGING BPM="30"><SINGING BPM="30"><PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH><PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH><PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH><PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH><PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH><PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH><PITCH NOTE="C4"><DURATION BEATS="0.3"><PITCH NOTE="C4"><DURATION BEATS="0.3">fahfah</DURATION></PITCH></DURATION></PITCH><PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH><PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH><PITCH NOTE="E4"><DURATION BEATS="0.3"><PITCH NOTE="E4"><DURATION BEATS="0.3">lahlah</DURATION></PITCH></DURATION></PITCH><PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH><PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH><PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH><PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH></SINGING></SINGING>Future in TTSMore natural voicesMore natural voicesSound humanSound humanInteract in a human way (not just words)Interact in a human way (not just words)More personalizationMore personalizationSound like a particular personSound like a particular personCross lingual synthesisCross lingual synthesisMore flexibleMore flexibleSay it with more feelingSay it with more feelingRealtimeRealtimevoice transformationvoice transformationHave an American accent while you speakHave an American accent while you speakText to speech processText analysisText analysisFrom characters to wordsFrom characters to wordsLinguistic analysisLinguistic analysisFrom words to pronunciationsFrom words to pronunciationsWaveform analysisWaveform analysisFrom pronunciations to noisesFrom pronunciations to noisesHW2: TTSDue 3:30pm Monday October 20Due 3:30pm Monday October 20ththInstall Festival and Install Festival and FestvoxFestvoxFind 10 errors in each of two different Find 10 errors in each of two different synthesizerssynthesizersBuild a voiceBuild a voiceA Talking ClockA Talking ClockA general voiceA general voice(or both)(or
View Full Document