DOC PREVIEW
AN EXAMINATION OF AUDIO-VISUAL

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

AN EXAMINATION OF AUDIO-VISUAL FUSED HMMS FOR SPEAKER RECOGNITIONDavid Dean∗, Tim Wark†and Sridha Sridharan∗∗Speech, Audio, Image and Video Research Laboratory†Queensland University of Technology &Queensland University of Technology CSIRO ICT CentreGPO Box 2434, Brisbane 4001, Australia Brisbane 4001, [email protected], [email protected] [email protected] hidden Markov models (FHMMs) have been shown towork well for the task of audio-visual speaker recognition,but only in an output decision-fusion configuration of boththe audio- and video-biased versions of the FHMM struc-ture. This paper looks at the performance of the audio-and video-biased versions independently, and shows that theaudio-biased version is considerably more capable for speakerrecognition. Additionally, this paper shows that by takingadvantage of the temporal relationship between the acousticand visual data, the audio-biased FHMM provides better per-formance at less processing cost than best-performing outputdecision-fusion of regular HMMs.1. INTRODUCTIONThe aim of audio-visual speaker recognition (AVSPR) is tomake use of complementary information between the acous-tic and visual domains to improve the performance of tradi-tional acoustic speaker recognition. Most current approachesto AVSPR either combine the output of individual hiddenMarkov models (HMMs) in each modality (late fusion), oruse a single HMM to classify both modalities (early fusion).Because the decisions or scores are combined at the whole-utterance level, late fusion cannot take true advantage of thetemporal dependencies between the two modalities. Whileearly fusion has the advantage that it can take advantage ofthese dependencies, it often suffers from problems with noise,and has difficulties in modelling the asychronicity of audio-visual speech [1]. The problems with performing AVSPRwith early or late fusion have led to the development ofmiddle-fusion methods, or models that accept two streams ofinput and combine the streams within the model to produce asingle score or decision.Most existing approaches to middle-fusion use coupledHMMs, which combine two single-stream HMMs by link-ing the dependencies of their hidden states. However, dueto the small number of hidden states in each modality, theseThis research was supported by a grant from the Australian ResearchCouncil (ARC) Linkage Project LP0562101dependencies are often not strong enough to capture the truedependency between the two streams [2]. Fused HMMs (FH-MMs) were developed by Pan et al [3] by attempting to designa model that maximises the mutual information between thetwo modalities within a multi-stream HMM. Pan et al foundthat the optimal multi-stream HMM design would result fromlinking the hidden states of one HMM to the observations ofthe other, rather than linking the hidden states together, as ina coupled HMM.This configuration means that FHMMs can be biasedtowards either modality, and the configuration chosen forAVSPR will depend upon which modality is judged to bemore reliable. Additionally the two biased FHMMs can becombined together using late fusion if the comparative relia-bility of each modality is less clear. In the introductory paperfor FHMMs [3], Pan et al found that a 50/50 fusion of the twobiased FHMMs performed significantly better than a numberof alternative AVSPR modelling techniques.In this paper, we propose to look at the performance ofthe each of the biased FHMMs individually, rather than in thedecision-fusion configuration used by Pan et al. By study-ing the suitability of each of the biased FHMMs to bothacoustic and visual degradation, future audio-visual speechresearch can take advantage of the idiosyncrasies of each bi-ased FHMM. In particular, if recognition can be performedadequately using only a single biased FHMM, the processingrequired is half that of the fusion of two biased FHMMs. Inaddition, the performance of the biased FHMMs will be com-pared to the decision fusion of normal single-stream HMMs.2. AUDIO-VISUAL FUSED HMMS2.1. ModellingConsider two tightly coupled time series OA=oA0, oA1, . . . , oAT −1and OV=oV0, oV1, . . . , oVT −1,corresponding to audio and video observations respectively.Assume that OAand OVcan be modelled by two HMMswith hidden states Ux=ux0, ux1, . . . , uxT −1, where x isA or V , respectively. In the FHMM framework, an optimalsolution for pOA; OVaccording to the maximum entropyprinciple [3] is given bypOA; OV= pOApOVp (w, v)p (w) p (v)(1)where w = gAOA, and v = gVOVare transforma-tions designed such that p (w, v) is easier to calculate thanpOA, OV, but still reflects the statistical dependence be-tween the two streams. The final term in (1) can therefore beviewed as a correlation weighting, which will be high if wand v are related, and low if they are mostly independent.In [3], Pan et al showed that according to maximum mu-tual information criterion, the transformations gAand gVcanresult in either of the following:w =ˆUA, v = OV(2)w = OA, v =ˆUV(3)whereˆUxis an estimate of the optimal state sequence ofHMM x for output Ox.By invoking (2) in pOA; OV:pAOA; OV= pOApOVpˆUA, OVpˆUAp (OV)= pOApOVˆUA(4)where pOAcan be obtained from the regular audio HMMand pOVˆUAis the likelihood of getting the video out-put sequence given the estimated audio HMM state sequencewhich produced OA. This equation represents the audio-biased FHMM as the main decoding process is the audioHMM.Similarly, invoking (3) to arrive at the video-biasedFHMM gives:pVOA; OV= pOVpOAˆUV(5)The choice of the audio- or video-biased FHMM shouldbe chosen upon which individual HMM can more reliably es-timate the hidden state sequence for a particular application.Alternatively, both versions can be use concurrently and com-bined using decision fusion, as in Pan et al.2.2. TrainingThe training of a biased FHMM is a three step process:1. The dominant individual HMM is trained indepen-dently2. The best hidden state sequence of the trained HMM isfound for each training observation using the Viterbiprocess [4](a) Regular HMMaudio (most reliable)video (least reliable)(b) Fused HMMFig. 1. State diagram representations of (a) a regular HMMand (b) a fused HMM3. The coupling parameters are determined between thehidden state sequences and the subordinate observa-tionsStep 1 establishes


AN EXAMINATION OF AUDIO-VISUAL

Download AN EXAMINATION OF AUDIO-VISUAL
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view AN EXAMINATION OF AUDIO-VISUAL and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view AN EXAMINATION OF AUDIO-VISUAL 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?