Speech Processing 15-492/18-492Speech SynthesisSignal ProcessingSignal ManipulationSignal Parameterization Signal Parameterization JoiningJoiningLPCLPCPSOLA: pitch and duration modificationPSOLA: pitch and duration modificationStatistical ParameterizationStatistical ParameterizationMELCEP/MLSAMELCEP/MLSALSF, STRAIGHT, HNM, HSMLSF, STRAIGHT, HNM, HSMTTS Signal ProcessingJoin together pieces of speechJoin together pieces of speechProsodic modification Prosodic modification Pitch (F0)Pitch (F0)DurationDurationPowerPowerChange spectral propertiesChange spectral propertiesStress/Stress/unstressunstressSpectral tiltSpectral tiltSpeaking styleSpeaking styleJoiningJust put them togetherJust put them togetherGets clicks at join pointsGets clicks at join pointsJoin them at zero crossingsJoin them at zero crossingsWindow them and overlap themWindow them and overlap themWSOLAWSOLAJoin them at pitch periodsJoin them at pitch periodsProsodic ModificationModify pitch and duration Modify pitch and duration independentlyindependentlyChanging sample rate changes bothChanging sample rate changes both“chipmunk” style speech“chipmunk” style speechDurationDurationDuplicate/delete parts of the signalDuplicate/delete parts of the signalPitchPitch“resample” to change pitch“resample” to change pitchSpeech and Short Term SignalsDuration ModificationPitch ModificationModify pitch and durationFind ideal pitch periods and durationFind ideal pitch periods and durationFind closest actual periods from unitsFind closest actual periods from unitsEnd withEnd withPitch period (short term signals)Pitch period (short term signals)Distances between themDistances between themSignal ReconstructionTDTD--PSOLA™PSOLA™Time domain pitch synchronous overlap and addTime domain pitch synchronous overlap and addPatented by France TelecomPatented by France TelecomExpired 2004Expired 2004Very efficient:Very efficient:No FFT (or inverse FFT)No FFT (or inverse FFT)Can modify Hz * 2.0 (or 0.5)Can modify Hz * 2.0 (or 0.5)The reason no one publishes algorithmsThe reason no one publishes algorithmsThe (partial) reason unit selection typically doesn’t The (partial) reason unit selection typically doesn’t do pitch/duration modificationdo pitch/duration modificationLPC: Linear predictive coding• Linear predictive coding– Predict next sample point from previous– Weighted sum of previous points– Filter of order p.– Residual excited LPCLPCWorks well but can be Works well but can be buzzybuzzyCan be very compactCan be very compactCan be pitch synchronousCan be pitch synchronousExcitedExcitedPulsePulseTriangular pulseTriangular pulseMultiMulti--pulsepulseFull residualFull residualUsed in standard speech codingUsed in standard speech codingLPC10: 2.4kpsLPC10: 2.4kpsCELP: codebook excited LPCCELP: codebook excited LPCOther Parametric RepresentationsTypically split spectral and residualTypically split spectral and residualMBROLA:MBROLA:MultiMulti--band overlap and addband overlap and addHNM/HSM:HNM/HSM:Harmonic plus (noise/stochastic) modelingHarmonic plus (noise/stochastic) modelingSTRAIGHTSTRAIGHTMELCEP/MLSAMELCEP/MLSAOften used in HMM synthesisOften used in HMM synthesisSinusoidal (HARMONIC)Sinusoidal (HARMONIC)WaveletWaveletLSF/LPCLSF/LPCChoosing the right unit typeDiphonesDiphonesPhonePhone--phone phone Joins at stable portions, not transitionsJoins at stable portions, not transitionsHalf phone (AT&T Natural Voices)Half phone (AT&T Natural Voices)Hybrid systems (Hybrid systems (HadifixHadifix––Bonn systems)Bonn systems)Other selection systems:Other selection systems:Syllable, phone, HMM stateSyllable, phone, HMM stateEven frame levelEven frame levelAcoustically Derived UnitsE.gE.gBacchianiBacchiani99 or Rita Singh CMU99 or Rita Singh CMUFrom some waveformsFrom some waveformsFind N most diverse unit typesFind N most diverse unit typesVaried in lengthVaried in lengthStill need to map letters to unitsStill need to map letters to unitsAcoustic Phonetic ClusteringParameterize databaseParameterize databaseMelcepMelcepplus powerplus powerKK--meansmeansEuclidean distance measureEuclidean distance measure100 clusters 100 clusters Label DB with best clusterLabel DB with best clusterBuild Build clunitsclunitssynthesizersynthesizerCan’t predict APC cluster directlyCan’t predict APC cluster directlyUse held out data for testingUse held out data for testingAcoustic Phonetic ClusteringGrapheme Based SynthesisSynthesis without a phoneme setSynthesis without a phoneme setUse the letters as phonemesUse the letters as phonemes(“(“alanalan” nil (a l a n))” nil (a l a n))(“black” nil ( b l a c k ))(“black” nil ( b l a c k ))Spanish (easier ?)Spanish (easier ?)419 utterances419 utterancesHMM training to label databasesHMM training to label databasesSimple pronunciation rulesSimple pronunciation rulesPolici’aPolici’a--> p o l i c i’ a> p o l i c i’ aCuatroCuatro--> c u a t r o> c u a t r oSpanish Grapheme SynthesisEnglish Grapheme Synthesis--Use Letters are phonesUse Letters are phones--26 26 ““phonemesphonemes””--( ( ““alanalan””n (a l a n))n (a l a n))--( ( ““blackblack””n (b l a c k))n (b l a c k))--Build HMM acoustic models for labelingBuild HMM acoustic models for labeling--For EnglishFor English--““This is a penThis is a pen””--““We went to the church at ChristmasWe went to the church at Christmas””--Festival introFestival intro--““do eight meatdo eight meat””--Requires method to fix errorsRequires method to fix errors--Letter to letter mappingLetter to letter mappingSignal Processing for TTSPitch and duration modificationPitch and duration modificationLPCLPCFinding the right unit typeFinding the right unit typeGraphemeGrapheme--based Synthesisbased SynthesisHW1: TTSDue 3:30pm Friday October 2Due 3:30pm Friday October 2ndndInstall
View Full Document