PERFORMANCE ANALYSIS

Home> Academic Documents> PERFORMANCE ANALYSIS

DOC PREVIEW

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

PERFORMANCE ANALYSIS OF COMPRESSED-DOMAIN AUTOMATIC SPEAKERRECOGNITION AS A FUNCTION OF SPEECH CODING TECHNIQUE AND BIT RATEM. Petracca, A. Servetti, J.C. De MartinDipartimento di Automatica e Informatica – Politecnico di TorinoCorso Duca degli Abruzzi, 24 — I-10129 Torino, ItalyE-mail: [matteo.petracca|servetti|demartin]@polito.itABSTRACTCompressed-domain automatic speaker recognition is basedon the analysis of the compressed parameters of speechcoders. The objective is to perform low-complexity on-linespeaker recognition for VoIP in the compressed domain, with-out the need to decode or resynthesize the speech bitstream.In this paper, we present initial results in determining therecognition accuracy that can be achieved with ﬁve widelyused speech coding standards. Experiments with a databaseof 14 speakers obtain a recognition ratio close to 100% afterthe analysis o f 30 seconds of active speech for most of theconsidered speech coders and rates. In particular, the resultsshow that performance does not strictly depend on coding rateor codec speech quality.1. INTRODUCTIONThe Internet is rapidly evolving into a universal communica-tion network that carries all types of trafﬁc, including voice,video and data. Among them, the most important trend overthe past few years was arguably the rapid g rowth of voice overIP (VoIP) services. In the coming years, with the continue in-crease in use of VoIP telephony, there will also be increasedinterest in the availability of online speaker recognition sys-tems for providing various interactive voice services via VoIPphones. Additionally, fast and scalable processing of VoIPpackets for speaker identiﬁcation will be a requirement forlaw enforcement agencies when wiretapping and eavesdrop-ping on VoIP provider high trafﬁc networks would be neces-sary.However, traditional automatic speaker recognition(ASR) cannot be directly applied to live VoIP calls because itoperates on the uncompressed (PCM) speech waveform whilevoice travels the IP networks mostly in a compressed format.Before transmission, in fact, the sender applies compressionstandards to reduce the amount of information that must besent to the other party. As a consequence the data has tobe decompressed to obtain an appoximation of the originalThe work was supported in part by Motorola Electronics S.p.A., MDBDevelopment Center , Turin, Italy.voice signal waveform before traditional speaker recognitionmethods can be applied. This time- and resource- consumingprocess is therefore unsuitable for an implementation in VoIPapparatuses or network sniffers wh ere a large number o f callsshould be monitored simultaneously.In this paper, we consider an alternative approach for per-forming on line speaker recognition from live streams of com-pressed voice packets. This method has been previously pre-sented as compressed-domain automatic speaker recognition(CD-ASR) in [1] [2] where voice feature vectors are made upof compressed bitstream values from coded speech frames.In [1] a tentative implementation limited to the GSM Adap-tive Multi-Rate (AMR) standard at 12.2 kb/s showed that, insome circumstances, speaker recognition in the compresseddomain is possible (for that particular coder) after the analy-sis of about 20 seconds of active speech. The objective of thispaper is to investigate if CD-ASR is applicable in a broadercontext to othe compressed speech formats, or, within theGSMAMRstandard,toothercoding bitrates. In particularwe adapt the speaker recognition algorithm to wid ely u sedspeech codecs for VoIP telephony that differ not only in thebitrate, but also in the compression technique. We consider,in fact, a low-bitrate L PC based mixed excitation (MELP)vocoder, some analysis-by-synthesis algorithms with multi-pulse (G.723) or codebook (GSM AMR, G.729) based exci-tation m odel, and a coder that does not employ inter-frameprediction (iLBC).The rest of this paper is organized as follows. Anoverview of ASR approaches is presented in Section 2 where,besides traditional systems that uses clean voice waveformsas input, we describe other approaches that work with codedspeech. CD-ASR is then discussed in Section 3. In Section 4we investigate the recognition rate achieved in our experi-ments with various speech codecs at different bit rates.2. OVERVIEW OF AUTOMATIC SPEAKERRECOGNITION APPROACHESFigure 1 illustrates the encoding, transmission and decodingchain for VoIP communications. Within this context, the four13931424403677/06/$20.00 ©2006 IEEE ICME 2006ENCODING DECODINGUncodedSpeech Coded speechCompressed parameters DecompressedParametersDecodedSpeechSpeechAnalysisParametersEncodingIP NetworkParametersDecodingSpeechSynthesis(1) (2)(4) (3)Fig. 1. In VoIP communications, the sender applies encoding standards to reduce the amount of information that is sent throughthe IP network. Hence, speech data traverses the network in a coded format and it has to be d ecoded and resynthesized at thereceiver to obtain a voice signal similar to the original waveform.mostly used ASR approaches may work, with different levelof complexity and performance, at the sender with uncodedspeech (1), at the receiver with decoded speech (2), at thereceiver with decompressed parameters (3), in the IP networkwith coded speech and compressed parameters (4).In the ﬁrst, most traditional, case input material is a dig-italized PCM representation of the voice wavefo rm (i.e., un-coded speech). This signal is Fourier transformed into the fre-quency domain where the magnitude spectrum from a short-time frame of speech is extracted. The spectrum is then pre-emphasized and processed by a simulated mel-scale ﬁlter-bank. Finally, the log-scaled output energy of each individ-ual ﬁlter is cosine transformed to produce the cepstral coef-ﬁcients. This processing may occur every 10 ms, producing100 feature vectors per second that are then used in a classiﬁ-cation algorithm such as the Gaussian Mixture Model - Uni-versal Background Model (GMM-UBM) as presented in [3].In the recent years however, due to the widespread use ofdigital speech communication systems, there has been an in-creasing necessity of a second automatic speaker recognitionapproach that uses decoded speech. The effect of speech cod-ing/decoding on speaker and language recognition tasks hasbeen analyzed for several coders and a wide range of bit rates(e.g., GSM at 12.2 kb/s, G.729 at 8 kb/s, and G.723.1 at 5.3kb/s) [4]. These studies showed that


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

Please select your school