DOC PREVIEW
USC CSCI 534 - CS534_2007

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Emotion recognition using neutral acoustic speech modelsCarlos Busso1, Emily Mower1, Takayuki Hidehira2Affective Computing (CSCI534)1Electrical Engineering Department,2Department of Computer ScienceUniversity of Southern California, Los Angeles, CA 90089{busso, mower, hidehira}@usc.eduAbstractSince emotional speech can be regarded as a variation ofneutral (non-emotional) speech, it is expected that a robust neu-tral speech model can be useful in contrasting different emo-tions expressed in speech. The study presented explores thisidea by creating acoustic models trained with spectral features,measured from the emotionally-neutral TIMIT corpus. Theperformance is tested using two emotional speech databases:one recorded with a microphone (acted), and another recordedfrom a telephone application (spontaneous). This method ob-tained accuracy levels of up to 78% and 65%, respectively,can be achieved in the binary and category emotion discrimi-nations. Raw Mel Filter Bank (MFB) output outperformed theconventional MFCC, with both broad-band and telephone-bandspeech. These results suggest that well-trained neutral acousticmodels can be effectively used as a front-end for emotion recog-nition, and once trained with MFB, it may work well regardlessof the channel characteristics.Index Terms: Emotion recognition, Neutral speech, HMMs,Mel filter bank (MFB), TIMIT1. IntroductionDetecting and utilizing non-lexical or paralinguistic cues froma user is one of the major challenges in the development ofusable human-machine interfaces (HMI). Notable among thesecues are the universal categorical emotional states (e.g., angry,happy, sad, etc.), prevalent in day-to-day scenarios. Knowingsuch emotional states should help to adjust system responses sothat the users interacting with the system can be more engagedand have a more effective interaction with the system.Identifying user’s emotion from the speech signal is desir-able. The recording and feature extraction process from thismodality is comparatively easier and simpler than that of othermodalities, such as facial expression and body posture. Previ-ous studies on the automatic categorization of emotional speechhave shown accuracy between 50% and 85% depending on thetask (e.g. number of emotion labels, number of speakers, sizeof database) [1]. A comprehensive review of the current ap-proaches is given in [2].However, such emotion categorization performance islargely specific to individual databases (usually off-line) andit is not plausible to easily generalize the results to differentdatabases or on-line recognition tasks. This is due to inherentspeaker-dependency in emotion expression, acoustic confusionamong emotional categories, and differences in acoustic envi-ronments across recording sessions. It is also fairly difficult, ifnot infeasible, to collect enough emotional speech data so thatone can train robust and universal acoustic models of individualemotions. This problem is further complicated when one con-siders that there exist more than a dozen emotional categorieswhose possible combinations can be used to differentiate affec-tive states and attitudes [3].As a possible way to circumvent the fundamental prob-lem in emotional categorization based on speech acoustics, thisstudy tests a novel idea of discriminating emotional speechagainst neutral (i.e., non-emotional) speech. That is, instead oftraining individual emotional models, we build a single, neutralspeech model and use it for emotion evaluation either in the cat-egorical approach or in the dimensional approach [3]. This ideais based on the assumption that emotional speech productionis a variant of the non-emotional counterpart in the (measur-able) feature space. It has been shown that the speech rate, thespeech duration, the fundamental frequency (F0) and the RMSenergy are simultaneously modulated to convey emotional in-formation [4]. In the articulatory domain, it has been shown thatthe tongue tip, jaw and lip kinematics are different from neutralspeech during expressive speech production [5, 6]. Hence, mod-eling the differential properties with respect to neutral speechis hypothesized to be advantageous. In addition, robust neu-tral acoustic speech models can be built because there are manyneutral speech corpora. This paper presents our first attempt toexamine this idea.In this preliminary report, the TIMIT database is usedto train neutral acoustic models and two emotional speechdatabases are probed. Hidden Markov Models (HMMs) aretrained with two different acoustic feature sets, Mel Filter Bank(MFB) and Mel-Frequency Cepstrum Coefficients (MFCCs),and their behaviors are examined in a broad phonetic-classrecognition experiment setting based on recognition likelihoodscores. Emotional discrimination performance by the two fea-ture sets are also investigated and compared using a discrimi-nant analysis. The results show that binary emotion recognitionaccuracies of up to 78% can be achieved using only these fea-tures.Building upon these ideas, this paper presents a semi real-time system to detect emotional speech. Based on the accuracyof the system, we claim that this approach is suitable to detectemotional speech in real applications.This paper is organized as follows. Section 2 describes theproposed approach. Section 3 analyzes the likelihood scoresfrom emotional and non-emotional speech obtained with theneutral models. Section 4 provides a discriminant analysis ofthe likelihood scores. Section 5 describes the implementationof this system. Finally, Section 6 gives a discussion and futuredirection of this work.Table 1: Broad phone classesDescription PhonemesF Front vowels iy ih eh ae ixB Mid/back vowels ax ah axh ax-h uw uh ao aa uxD Diphthong ey ay oy aw owL Liquid and glide l el r y w er axrN Nasal m n en ng em nx engT Stop b d dx g p t k pcl tcl kcl qcl bcl dcl gcl q epiC Fricatives ch j jh dh z zh v f th s sh hh hvS Silence sil h# #h pau2. MethodologyThe proposed emotional segmentation approach is a two stepprocess, as described in Figure 1. In the first step, neutral mod-els are built to measure the degree of similarity between theinput speech and the reference neutral speech. The output ofthis block is a fitness measure of the input speech. In the secondstep, these measures are used as features to infer whether theinput speech is emotional or neutral. The primary focus of thispaper is placed on the first block.Step 2: ClassificationStep 1: Assessing the modelsFigure 1: Two step


View Full Document

USC CSCI 534 - CS534_2007

Download CS534_2007
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS534_2007 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS534_2007 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?