TAMU CSCE 689 - arslan1999voiceConversionSLIDES

Unformatted text preview:

Speaker Transformation Algorithm using Segmental Codebooks (STASC)Presented by A. Brian DavisSpeaker Transformation●Goal: map acoustic properties of one speaker onto another●Uses:−Personification of text-speech systems−Multimedia−Preprocessing step for speech recognition●Reduce speaker variability●Practical?Steps Involved●Training phase−Given speech input from source and target, form spectral transformation−Inputs / outputs to transformation:●Segment speech small chunks (frames)−Formants−LPC cepstrum coefficients−Others (excitation)?−Can we generalize behavior of transform?●Codebooks/codewords−Vector quantizationVector quantization●Assign vectors to discrete set of values●K-Means−For STASC, also want“average” all vectorsassigned to a class−K-Means gives us thisfor freeShamelessly stolen from Dr. Gutierrez's pattern recognition slidesLSFs●Line spectral frequencies●Derived (losslessly) from LPC's−Can convert to/from, thus can create speech from LSFs●Relate to formant frequencies−Used in STASC represent vocal tract of speakers●Stable●Why use instead of MFCCs?STASC (first method)●Assumes orthographic transcription−What's said, in writing●From transcription, phonemes retrieved−Speech segments assigned phoneme based on transcription●MFCCs, dMFCCs for each segment (frame) passed into HMM, most likely path using Viterbi algorithm−LSFs calculated per frame, labeled with phoneme from HMM−Phoneme centroids calculated (average LSF values all vectors labeled particular phoneme−One-one mappingSecond method (better)●No orthographic transcription−Intuitively, we know the HMM states in 1st method didn't need correspond phonemes●Require speakers speak same (hopefully phonetically balanced) sentence−Sentences with phones approx. distributed as in normal speech−Because fewer restrictions, need to do some extra processing of speaker's speech●Normalize root-mean-squared energy●Remove silence before/after speechSecond method transformation●HMM trained on each sentence−Data from source speaker's speech segments●LSF vectors−Number of states correspond sentence length−Segmental k-means, separates speech segments into clusters−Baum-Welch algorithm train HMM on cluster averages●Covariance matrix uniform●For source/target speech segments, Viterbi algorithm assigns segments to states.●Transformation moves segments from state in source to state in target●CentroidsExcitation characteristic●From previous papers, know excitation greatly influences perception of speaker●Not trivial to transfer−Very different for voiced / unvoiced sounds●Use current codebooks to transfer excitation−Calculate short-time average magnitude spectrum of excitation signal each “speech unit”Codebook weight estimation●Assume we have vector w of LSFs labeled with HMM state●Also centroids Si of each HMM state●Algorithm:−Calculate distances di from w to Si●Perceptual distance – closely spaced LSFs correspond to formant locations given higher weight−From distances, calculate weights vi, represent w as linear combination Si's●Minimize error?Gradient Descent●Find local optimum weights minimize error reconstructed LSFs vs actual LSFs●Algorithm:−Find gradient of difference reconstruction, predicted (weighted perceptually)−Weight gradient by small value (speed to convergence)−Add to old weights−Until difference in weights between iterations is sufficiently small●Found that only few weights given large value−Only use 5 most likely weights●15% additional reduction in Itakura-Saito distance, .4 dB errorUse of weights●Given reconstruct LSF vector (segment of speech from speaker) from linear combination of sigmoids●Use those weights and target's sigmoids, use resulting LSFs to reconstruct speech●Other transformations?−Excitation spectral characteristics−Prosody−Can estimate new weights for all, but why?Artist's impressionExcitation and Vocal Tract●Use weights construct excitation filter−linear combination of sigmoids' ( average target excitation magnitude spectra ) over (source EMS)●Use weights construct vocal tract spectrum – convert transformed LSF vectors to LPCs−Expansion of bandwidths; gives unnatural speechVt=∣11−∑k =1Pakte− jk ∣Bandwidth modification●Assume average formant bandwidth values of target speaker similar most likely target codeword (LSF centroid)●Since LSFs correspond to formant locations / bandwidths, change bandwidths by changing adjacent LSF distances●Algorithm:−Find LSF entries directly before/after each formant location in most likely Target codeword−Calculate average formant bandwidth−Same for corresponding speech segment LSF vectors−form ratio of average codeword bandwidth over segment bandwidth−Apply estimated bandwidth ratio to adjust LSFs of speech segment vectors−Enforce reasonable bandwidths (average bandwidth of most likely centroid from target speech over 20Bandwidth modification resultProsodic Transformation●Pitch, duration, energy modified to mimic target●Dynamic segment lengths−Constant for unvoiced, 2-3 pitch periods for voiced●Pitch:−No weights involved−Modify f0 linearly, matching variance f0s, matching averagesDuration●Uniform duration matching?●Different people pronounce different phonemes differently●Need finer control duration modificationDuration modification●Duration phoneme dependent context (coarticulation)−Triphones as speech units●Find speech unit centroids (durations), weights per segment, form target duration as linear combination ●Uses?−Human transcriptionEnergy scale modification●Another characteristic of speaker●Algorithm (finding energy scaling factor per time frame):−Calculate RMS energy for each codeword−Derive weights for representing scaling factor as linear combination (target's RMS energy) over (source's RMS energy)−After applying other modifications, scale energyEvaluations●Want to test effectiveness of transformation−Speaker recognition−Speech recognition●Objective and subjective−Automatic speech recognizer−Human subjects●TestObjective●Idea: confuse a speaker recognition machine−Stacking the deck−Confidence measure●The machine:−256 mixture Gaussian mixture models●24 dimension feature vector (MFCCs, deltas)−Binary split vector quantization●One vector for all, split to two in arbitrary directions−Train


View Full Document

TAMU CSCE 689 - arslan1999voiceConversionSLIDES

Documents in this Course
slides

slides

10 pages

riccardo2

riccardo2

33 pages

ffd

ffd

33 pages

intro

intro

23 pages

slides

slides

19 pages

p888-ju

p888-ju

8 pages

w1

w1

23 pages

vfsd

vfsd

8 pages

subspace

subspace

48 pages

chapter2

chapter2

20 pages

MC

MC

41 pages

w3

w3

8 pages

Tandem

Tandem

11 pages

meanvalue

meanvalue

46 pages

w2

w2

10 pages

CS689-MD

CS689-MD

17 pages

VGL

VGL

8 pages

ssq

ssq

10 pages

Load more
Download arslan1999voiceConversionSLIDES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view arslan1999voiceConversionSLIDES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view arslan1999voiceConversionSLIDES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?