AN EXAMINATION OF AUDIO-VISUAL (5 pages)

Previewing pages 1, 2 of 5 page document View the full content.
View Full Document

AN EXAMINATION OF AUDIO-VISUAL



Previewing pages 1, 2 of actual document.

View the full content.
View Full Document
View Full Document

14 views

Unformatted text preview:

AN EXAMINATION OF AUDIO VISUAL FUSED HMMS FOR SPEAKER RECOGNITION David Dean Tim Wark and Sridha Sridharan Speech Audio Image and Video Research Laboratory Queensland University of Technology GPO Box 2434 Brisbane 4001 Australia ddean ieee org s sridharan qut edu au ABSTRACT Fused hidden Markov models FHMMs have been shown to work well for the task of audio visual speaker recognition but only in an output decision fusion configuration of both the audio and video biased versions of the FHMM structure This paper looks at the performance of the audioand video biased versions independently and shows that the audio biased version is considerably more capable for speaker recognition Additionally this paper shows that by taking advantage of the temporal relationship between the acoustic and visual data the audio biased FHMM provides better performance at less processing cost than best performing output decision fusion of regular HMMs 1 INTRODUCTION The aim of audio visual speaker recognition AVSPR is to make use of complementary information between the acoustic and visual domains to improve the performance of traditional acoustic speaker recognition Most current approaches to AVSPR either combine the output of individual hidden Markov models HMMs in each modality late fusion or use a single HMM to classify both modalities early fusion Because the decisions or scores are combined at the wholeutterance level late fusion cannot take true advantage of the temporal dependencies between the two modalities While early fusion has the advantage that it can take advantage of these dependencies it often suffers from problems with noise and has difficulties in modelling the asychronicity of audiovisual speech 1 The problems with performing AVSPR with early or late fusion have led to the development of middle fusion methods or models that accept two streams of input and combine the streams within the model to produce a single score or decision Most existing approaches to middle fusion use coupled HMMs which combine two single stream HMMs by linking the dependencies of their hidden states However due to the small number of hidden states in each modality these This research was supported by a grant from the Australian Research Council ARC Linkage Project LP0562101 Queensland University of Technology CSIRO ICT Centre Brisbane 4001 Australia tim wark csiro au dependencies are often not strong enough to capture the true dependency between the two streams 2 Fused HMMs FHMMs were developed by Pan et al 3 by attempting to design a model that maximises the mutual information between the two modalities within a multi stream HMM Pan et al found that the optimal multi stream HMM design would result from linking the hidden states of one HMM to the observations of the other rather than linking the hidden states together as in a coupled HMM This configuration means that FHMMs can be biased towards either modality and the configuration chosen for AVSPR will depend upon which modality is judged to be more reliable Additionally the two biased FHMMs can be combined together using late fusion if the comparative reliability of each modality is less clear In the introductory paper for FHMMs 3 Pan et al found that a 50 50 fusion of the two biased FHMMs performed significantly better than a number of alternative AVSPR modelling techniques In this paper we propose to look at the performance of the each of the biased FHMMs individually rather than in the decision fusion configuration used by Pan et al By studying the suitability of each of the biased FHMMs to both acoustic and visual degradation future audio visual speech research can take advantage of the idiosyncrasies of each biased FHMM In particular if recognition can be performed adequately using only a single biased FHMM the processing required is half that of the fusion of two biased FHMMs In addition the performance of the biased FHMMs will be compared to the decision fusion of normal single stream HMMs 2 AUDIO VISUAL FUSED HMMS 2 1 Modelling Consider time OA V series A A two A tightly coupled V V o0 o1 oVT 1 o0 o1 oT 1 and O corresponding to audio and video observations respectively Assume that OA and OV can be modelled by two HMMs with hidden states U x ux0 ux1 uxT 1 where x is A or V respectively In the FHMM framework an optimal solution for p OA OV according to the maximum entropy principle 3 is given by p w v p OA p OV 1 p w p v where w gA OA and v gV OV are transformations designed such that p w v is easier to calculate than p OA OV but still reflects the statistical dependence between the two streams The final term in 1 can therefore be viewed as a correlation weighting which will be high if w and v are related and low if they are mostly independent In 3 Pan et al showed that according to maximum mutual information criterion the transformations gA and gV can result in either of the following p OA OV w U A w OA v OV v U V 2 3 where U x is an estimate of the optimal state sequence of HMM x for output Ox By invoking 2 in p OA OV A V U O p pA OA OV p OA p OV p U A p OV 4 p OA p OV U A A a Regular HMM where p O can be obtained from the regular audio HMM V A and p O U is the likelihood of getting the video output sequence given the estimated audio HMM state sequence which produced OA This equation represents the audiobiased FHMM as the main decoding process is the audio HMM Similarly invoking 3 to arrive at the video biased FHMM gives pV OA OV p OV p OA U V 5 The choice of the audio or video biased FHMM should be chosen upon which individual HMM can more reliably estimate the hidden state sequence for a particular application Alternatively both versions can be use concurrently and combined using decision fusion as in Pan et al 2 2 Training The training of a biased FHMM is a three step process 1 The dominant individual HMM is trained independently 2 The best hidden state sequence of the trained HMM is found for each training observation using the Viterbi process 4 audio most reliable b Fused HMM video1 least reliable Fig State diagram representations of a a regular HMM and b a fused HMM 3 The coupling parameters are determined between the hidden state sequences and the subordinate observations Step 1 establishes the model parameters of the dominant HMM and step 2 gives the estimate state sequence U d of the dominant HMM that produces the dominant training observations Od The calculation of the coupling parameters are determined as follows B d s arg max p Os U d 6


Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view AN EXAMINATION OF AUDIO-VISUAL and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view AN EXAMINATION OF AUDIO-VISUAL and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?