Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Why is ASR Hard?Why is ASR Hard?•Natural speech is continuous•Natural speech has disfluencies•Natural speech is variable over:global rate, local rate, pronunciationwithin speaker, pronunciation acrossspeakers, phonemes in differentcontextsWhy is ASR Hard?Why is ASR Hard?(continued)(continued)•Large vocabularies are confusable•Out of vocabulary words inevitable•Recorded speech is variable over:room acoustics, channel characteristics,background noise•Large training times are not practical•User expectations are for equal to orgreater than “human performance”Main Causes of Main Causes of Speech VariabilitySpeech VariabilityEnvironmentSpeakerInputEquipment Speech - correlated noisereverberation, reflectionUncorrelated noiseadditive noise(stationary, nonstationary) Attributes of speakersdialect, gender, age Manner of speakingbreath & lip noisestressLombard effectratelevelpitchcooperativenessMicrophone (Transmitter)Distance from microphoneFilterTransmission systemdistortion, noise, echoRecording equipmentASR DimensionsASR Dimensions•Speaker dependent, independent•Isolated, continuous, keywords•Lexicon size and difficulty•Task constraints, perplexity•Adverse or easy conditions•Natural or read speechTelephone SpeechTelephone Speech•Limited bandwidth (F vs S)•Large speaker variability•Large noise variability•Channel distortion •Different handset microphones•Mobile and handsfree acousticsAutomatic Speech Automatic Speech RecognitionRecognitionData CollectionPre-processingFeature ExtractionHypothesis GenerationCost EstimatorDecodingPre-processingPre-processingRoomAcousticsSpeechMicrophoneLinearFilteringSampling &DigitizationIssue: Effect on modelingFeature ExtractionFeature ExtractionSpectralAnalysisAuditoryModel/NormalizationsIssue: Design for discriminationRepresentations Representations are Importantare ImportantNetwork23% frame correctNetwork70% frame correctSpeech waveformPLP featuresHypothesis GenerationHypothesis GenerationIssue: models of language and taskcatdoga dog is not a cata cat not is adogCost EstimationCost Estimation•Distances•-Log probabilities, from discrete distributions Gaussians, mixtures neural networksDecodingDecodingPronunciation ModelsPronunciation ModelsLanguage ModelsLanguage ModelsMost likely words for largest productP(acousticswords) - P(words)P(words) =  P(wordshistory)•bigram, history is previous word•trigram, history is previous 2 words•n-gram, history is previous n-1 wordsSystem ArchitectureSystem ArchitecturePronunciationLexiconSignal ProcessingProbabilityEstimatorDecoderRecognizedWords“zero”“three”“two”Probabilities“z” -0.81“th” = 0.15“t” =


View Full Document

Berkeley ELENG 225D - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?