Columbia ELEN E4896 - Machine Recognition of Sounds in Mixtures

Unformatted text preview:

Machine Recognition of Sounds in MixturesComputational Auditory Scene Analysis (CASA)• What is the goal of CASA?Segregation vs. InferenceOutlineSpeech Recognition as Source FormationHow ASR Represents SpeechSequence RecognitionApproaches to sound mixture recognitionOutlineSound Fragment DecodingMissing Data RecognitionMissing Data ResultsComparing different segregationsCalculating fragment matchesUsing CASA featuresFragment decodingMulti-Source DecodingOutlineSpeech fragment decoder resultsAlarm sound detectionAlarms: ResultsSummary & ConclusionsEllis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 1 Machine Recognition of Sounds in Mixtures Outline Computational Auditory Scene AnalysisSpeech Recognition as Source FormationSound Fragment DecodingResults & Conclusions Dan Ellis <[email protected]>LabROSA, Columbia University, New YorkJon Barker <[email protected]>SPandH, Sheffield University, UK1234Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 2 Computational Auditory Scene Analysis(CASA) • Human sound organization:Auditory Scene Analysis - composite sound signal → separate percepts- based on ecological constraints- acoustic cues → perceptual grouping • Computational ASA:Doing the same thing by computer ...?1Object 1 perceptSound mixtureObject 2 perceptObject 3 perceptEllis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 3 What is the goal of CASA? • Separate signals? - output is unmixed waveforms- underconstrained, very hard ... - too hard? not required? • Source classification? - output is set of event-names- listeners do more than this... • Something in-between?Identify independent sources + characteristics - standard task, results?CASA?Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 4 Segregation vs. Inference • Source separation requires attribute separation - sources are characterized by attributes(pitch, loudness, timbre + finer details)- need to identify & gather different attributes for different sources ... • Need representation that segregates attributes - spectral decomposition- periodicity decomposition • Sometimes values can’t be separated -e.g. unvoiced speech-maybe infer factors from probabilistic model?-or: just skip those values, infer from higher-level contextpOxy,,()pxy, O()→Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 5 Outline Computational Auditory Scene Analysis Speech Recognition as Source Formation - Standard speech recognition- Handling mixtures Sound Fragment DecodingResults & Conclusions1234Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 6 Speech Recognitionas Source Formation •Automatic Speech Recognition (ASR):the most advanced sound analysis• ASR extracts abstract information from sound - (i.e. words)-even in mixtures (noisy backgrounds) .. a bit • ASR is not signal extraction:only certain signal information is recovered -.. just the bits we care about • Not CASA preprocessing for ASR:Instead, approach ASR as an example of CASA -words = description of source properties- uses strong prior constraints: signal models-but: must handle mixtures!2Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 7 How ASR Represents Speech • Markov model structure: states + transitions• Generative model -but not a good speech generator!- only meant for inference of p ( X | M )KAS0.80.180.90.10.020.90.1T E0.80.2 M'1OKAS0.80.050.90.10.150.90.1T E0.80.2OModel M'201234freq / kHz10 20 30 40 501020304050State models (means)State Transition Probabilities0 1 2 3 4 501234freq / kHztime / secEllis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 8 Sequence Recognition • Statistical Pattern Recognition:• Markov assumption decomposes into frames:• Solve by searching over all possible state sequences { m n } .. but with efficient pruning:M∗PMX()Margmax PXM()PM()⋅PX()---------------------------------------Margmax ==modelsobservationsPXM() pxnmn()pmnmn 1–()n∏=kaxrzsdowiydoyuwbrootDODECOYDECODESDECODESDECODERDECODEEllis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 9 Approaches to sound mixture recognition • Separate signals, then recognize -e.g. (traditional) CASA, ICA- nice, if you can do it • Recognize combined signal -‘multicondition training’- combinatorics.. • Recognize with parallel models - full joint-state space?- divide signal into fragments, then use missing-data recognitionEllis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 10 Outline Computational Auditory Scene AnalysisSpeech Recognition as Source FormationSound Fragment Decoding - Missing Data Recognition- Considering alternate segmentations Results & Conclusions1234Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 11 Sound Fragment Decoding • Signal separation is too hard!Instead: - segregate features into partially-observed sources- then classify • Made possible by missing data recognition - integrate over uncertainty in observations for true posterior distribution • Goal:Relate clean speech models P ( X | M ) to speech-plus-noise mixture observations -.. and make it tractable3Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 12 Missing Data Recognition • Speech models p ( x|m) are multidimensional...- i.e. means, variances for every freq. channel- need values for all dimensions to get p(•)• But: can evaluate over a subset of dimensions xk• Hence, missing data recognition:- hard part is finding the mask (segregation)xkxuyp(xk,xu)p(xk|xu<y )p(xk )p xkm()p xkxu, m()xud∫=P(x1 | q) P(x | q) = · P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q) Present data masktime →dimension →Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 13Missing Data Results• Estimate static background noise level N(f)• Cells with energy close to background are considered “missing”-must use spectral features!• But: nonstationary noise → spurious mask bits- can we try removing parts of mask?"1754" + noiseSNR maskMissing DataClassifier"1754"0 5 10 15 2020406080100Factory NoiseDigit recognition accuracy / %SNR (dB)a priorimissing dataMFCC+CMN∞Ellis & Barker Machine Recognition of Sounds in Mixtures 2003-04-29 - 14Comparing different segregations• Standard classification chooses between models M to match source features X• Mixtures: observed features Y,


View Full Document

Columbia ELEN E4896 - Machine Recognition of Sounds in Mixtures

Download Machine Recognition of Sounds in Mixtures
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine Recognition of Sounds in Mixtures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine Recognition of Sounds in Mixtures 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?