TAMU CSCE 689 - arslan 1999 voice Conversion - D632566

Home> Schools> Texas A&M University> Computer Sci. & Engr. (CSCE) > CSCE 689> arslan 1999 voice Conversion

TAMU CSCE 689 - arslan 1999 voice Conversion

Pages 16

Download Save

Unformatted text preview:

Speaker Transformation Algorithm using SegmentalCodebooks (STASC)1Levent M. Arslan2Electrical and Electronics Department, Bogazicßi University, Bebek, 80815 Istanbul, TurkeyReceived 27 February 1998; received in revised form 17 December 1998; accepted 8 February 1999AbstractThis paper presents a new voice conversion algorithm which modi®es the utterance of a source speaker to sound-likespeech from a target speaker. We refer to the method as Speaker Transformation Algorithm using Segmental Code-books (STASC). A novel method is proposed which ®nds accurate alignments between source and target speaker ut-terances. Using the alignments, source speaker acoustic characteristics are mapped to target speaker acousticcharacteristics. The acoustic parameters included in the mapping are vocal tract, excitation, intonation, energy, andduration characteristics. Informal listening tests suggest that convincing voice conversion is achieved while maintaininghigh speech quality. The performance of the proposed system is also evaluated on a simple Gaussian mixture model-based speaker identi®cation system, and the results show that the transformed speech is assigned higher likelihood bythe target speaker model when compared to the source speaker model. Ó 1999 Elsevier Science B.V. All rights re-served.Keywords: Voice conversion; Speaker transformation; Codebook; Line spectral frequencies; Hidden Markov models; Time-varying®lter; Overlap-add analysis1. IntroductionThere has been a considerable amount of re-search eort directed at the problem of voicetransformation recently (Abe et al., 1988; Baudoinand Stylianou, 1996; Childers, 1995; Iwahashi andSagisaka, 1995; Kuwabara and Sagisaka, 1995;Narendranath et al., 1995; Stylianou et al., 1995).This topic has numerous applications which in-clude personi®cation of text-to-speech systems,multimedia entertainment, and as a preprocessingstep to speech recognition to reduce speaker vari-ability. In general, the approach to the problemconsists of a training phase where input speechtraining data from source and target speakers areused to formulate a spectral transformation thatwould map the acoustic space of the sourcespeaker to that of the target speaker. The acousticspace can be characterized by a number of possibleacoustic features which have been studied exten-sively in the literature. The most popular featuresused for voice transformation include formantfrequencies (Abe et al., 1988; Narendranath et al.,1995), and LPC cepstrum coecients (Lee et al.,1996). The transformation is in general based oncodebook mapping (Abe et al., 1988; Acero, 1993;Baudoin and Stylianou, 1996; Lee et al., 1996).www.elsevier.nl/locate/specomSpeech Communication 28 (1999) 211±2261Speech ®les available. See www.elsevier.nl/locate/specom.2Tel.: +90 212 2631540/1421; fax: +90 212 2872465; e-mail:[email protected]. The author was with Entropic ResearchLaboratory, Washington, DC.0167-6393/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved.PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 1 5 - 1That is, a one-to-one correspondence between thespectral codebook entries of the source speakerand the target speaker is developed by some formof supervised vector quantization method. Ingeneral, these methods face several problems suchas artifacts introduced at the boundaries betweensuccessive speech frames, limitation on robust es-timation of parameters (e.g., formant frequencyestimation), or distortion introduced during syn-thesis of target speech. Another issue which hasnot been explored in detail is the transformation ofthe excitation characteristics in addition to thevocal tract characteristics. Several studies pro-posed solutions to address this issue recently(Childers, 1995; Lee et al., 1996). In this study, wepropose new and eective solutions to bothproblems with the goal of maintaining high speechquality.2. Algorithm descriptionThis section provides a general description ofthe Speaker Transformation Algorithm usingSegmental Codebooks (STASC) algorithm. Wewill describe the algorithm under two main sec-tions: (i) transformation of spectral characteristics,(ii) transformation of prosodic characteristics.2.1. Spectral transformationFor the representation of the vocal tract char-acteristics of source and target speakers line spec-tral frequencies (LSF) are selected. The reason forselecting LSFs is that these parameters relateclosely to formant frequencies (Crosmer, 1985),but in contrast to formant frequencies they can beestimated quite reliably. They have been used for anumber of applications successfully in the litera-ture (Hansen and Clements, 1991; Arslan et al.,1995; Arslan and Talkin, 1997; Crosmer, 1985;Laroia et al., 1991; Itakura, 1975; Pellom andHansen, 1997). They have good interpolationproperties and they are stable (Paliwal, 1995). Inaddition, they have a ®xed dynamic range whichmakes them attractive for real-time DSP imple-mentation. LSFs can be estimated by modifyingthe LPC polynomial, Az, in two ways: P z andQz are obtained by augmenting Az's PARCORsequence with a  1 and ÿ1, respectively. Thisresults in the following two polynomials whichhave all their roots on the unit circle:Pz  1 ÿ zÿ1YPÿ1k1;3;5;...1 ÿ 2 cos wkzÿ1 zÿ2;Qz  1  zÿ1YPÿ1k2;4;6;...1 ÿ 2 cos wkzÿ1 zÿ2;1where P is the LPC analysis order, and the anglesof the roots, wk, are LSFs. In STASC algorithm,codebooks of LSFs are used to represent the vocaltract characteristics of individual speakers. Thecodebooks can be generated in two ways.The ®rst method assumes that the orthographictranscription is available along with the trainingdata. The training speech (sampled at 16 kHz)from source and target speakers are ®rst seg-mented automatically using forced alignment to aphonetic translation of the orthographic tran-scription. The segmentation algorithm uses Mel-cepstrum coecients and delta coecients withinan HMM framework and is described in detail in(Wightman and Talkin, 1994). The LSFs forsource and target speaker utterances are calculatedon a frame-by-frame basis and each LSFs vector islabeled using the phonetic segmenter. Next, acentroid LSFs vector for each phoneme is esti-mated for both source and target speaker code-books by averaging across all the correspondingspeech frames. The estimated codebook spectra foran example male source speaker and female

View Full Document


School:
Email:
New Password:
Confirm Password:

TAMU CSCE 689 - arslan 1999 voice Conversion

Sign up for free to view:

Please select your school