Unformatted text preview:

Issues for Processing Speech Mary Harper 12 4 08 1 Spontaneous Speech Challenges Language Processing Approaches so we need but how do we get them out I say we have we set a string of charges that will root them out the back so t the charges start at the front and just explode and blow a little something up but are really really loud and and marsupials have really good ears so that ll be real that ll really frighten them Speech Recognition 12 4 08 3 The Challenges of Spontaneous Speech for Language Processing Difficult for the recognizer ASR errors insertions deletions and substitutions Phenomena atypical of textual sources e g filled pauses speech repairs Acoustic challenges fragments filled pauses coarticulation Language models do not currently model disfluencies adequately Recognition output is difficult for humans to read Recognition output is difficult for NLP Sentence boundaries are NOT provided and ASR segments are often inappropriate Utterances are different planned on the fly from written text Much of spoken language is used for organizing the communication e g And so Speech repairs are challenging 12 4 08 4 Assorted Spontaneous Speech Phenomena Filled pauses I think it s uh refreshing to see the uh support Parentheticals but you know I was reading the other day Speech repairs why didn t he why didn t she stay at home Partial Words cut between these t these trees Ungrammatical constructions my friends is visiting me 12 4 08 5 Enrich Word Stream with Structural Metadata so we need but how do we get them out I say we have we set a string of charges that will root them out the back so t the charges start at the front and just explode and blow a little something up but are really really loud and and marsupials have really good ears so that ll be real that ll really frighten them Metadata Extraction and Transcription Reduce STT errors Clean up enrich output Metadata Extraction MDE Speech to Text STT Essential core capability Speakers boundaries disfluencies Rich Transcript Words times confidences Structural Metadata Extraction Tasks Sentence Unit SU detection find the sentence like units and their subtypes Filler word detection filled pauses discourse markers e g you know explicit editing terms Interruption point IP detection e g we have we set a string of charges Edit word detection reparandum region of a speech repair e g we have we set a string of charges Motivation for Rich Transcriptions Adding additional information to a transcription should Aid downstream language processing provide sentence boundaries indicate structure of disfluencies Improve readability to humans adding punctuation removing disfluencies e g MITLL readability experiments Improve ASR performance e g feedback metadata information to recognizer to aid language models e g Work by Sebastien Coquoz visiting ICSI from EPFL 12 4 08 9 Feedback Structural Information to ASR Work by Sebastien Coquoz visiting ICSI from EPFL Motivation Linguistic segments are more appropriate for LMs than acoustically segmented speech vs non speech chunks Error analysis of BN recognition reveals a higher error rate at ASR segment boundaries Approach use automatically detected sentence boundaries to re segment speech and then re recognize Results on BN corpus RT 03 eval set Recognizer automatic ASR segments 14 0 WER Use reference boundary information 13 0 Using system boundary information 13 3 so far This segmentation helps STT 12 4 08 10 The Challenge of Parsing Speech There is a mismatch between ASR systems and statistical parsers Segments processed by an ASR system do not typically correspond to segments that statistical parsers normally work with ASR systems Traditional parsers are text based 12 4 08 Produce long word strings without punctuation Word strings often contain errors insertions deletions and substitutions Word strings contain phenomena that do not typically occur in textual sources e g filled pauses speech repairs Don t use acoustic cues Process sentences not segments Process input without word errors Process textual input without spontaneous speech phenomena 11 How to Enable Effective Downstream Processing of Speech Metadata extraction Providing sentence boundaries and disfluency annotations Challenging speech is difficult Parsing Structure enables other downstream processing Challenging parsing has been traditionally text centered Need to deal with speech related phenomena Performance metrics exist for parsing text that need to be adapted to speech Data Resources The RT 04 conversational telephone speech data annotated with structural metadata was used in the RT 04 MDE benchmark tests Gold standard parses from the LDC treebanking team for dev dev2 and eval sets Recognition output from state of the art recognizers for the EARS RT 04 data Using this new data allowed us to evaluate the synergy between parsing and MDE system performance conversations SUs words dev 72 11K 71K dev2 36 5K 35K eval 36 5K 34K Resource JHU Speech Parsing Corpus LDC2005E15 A unified conversational speech resource with consistent metadata markups and parse trees Metadata markup Sentence boundaries Speech disfluencies Treebank parse trees Syntactic structure Restarts An Example FL ST well FL END EDIT ST i EDIT END i know it s cold outside there now huh S1 S INTJ UH well EDITED S NP PRP i DISFL IP NP PRP i VP VBP know SBAR S NP PRP it VP BES s ADJP JJ cold ADVP RB outside ADVP RB there RB now INTJ UH huh An Example FL ST well FL END EDIT ST i EDIT END i know it s cold outside there now huh S1 S INTJ UH well EDITED S NP PRP i NP PRP i VP VBP know SBAR S NP PRP it VP BES s ADJP JJ cold ADVP RB outside ADVP RB there RB now INTJ UH huh Data 144 conversations 140 000 words 21 000 SUs syntactic semantic units Transcribed English conversational telephone speech originally developed for the DARPA EARS Efficient Affordable Reusable Speech To Text Program Switchboard LDC97S62 and Fisher Protocol LDC2004E16 LDC2004E29 LDC2005E73 The Fisher data was carefully transcribed at LDC using RT 04 Transcription Specification Version 3 1 Measuring Parse Accuracy on Speech Parsing techniques are now being applied to automatic speech recognition ASR output with Automatic transcripts Automatically generated sentence segments SUs that differ in many cases from the gold words and segments This creates the need to develop and evaluate new methods for determining spoken parse accuracy that support evaluation when the yields of gold standard parse trees may differ from


View Full Document

UMD CMSC 723 - Issues for Processing Speech

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Loading Unlocking...
Login

Join to view Issues for Processing Speech and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Issues for Processing Speech and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?