ASR Evaluation Julia Hirschberg CS 4706 Outline Intrinsic Methods Transcription Accuracy Word Error Rate Automatic methods toolkits Limitations Concept Accuracy Limitations Extrinsic Methods Evaluation How to evaluate the goodness of a word string output by a speech recognizer Terms R 01 14 19 Speech and Language Processing Jurafsky and Martin 3 Evaluation How to evaluate the goodness of a word string output by a speech recognizer Terms ASR hypothesis ASR output Reference transcription ground truth what was actually said Transcription Accuracy Word Error Rate WER Minimum Edit Distance Distance in words between the ASR hypothesis and the reference transcription Edit Distance Substitutions Insertions Deletions N For ASR usually all weighted equally but different weights can be used to minimize difference types of errors WER Edit Distance 100 WER Calculation Word Error Rate 100 Insertions Substitutions Deletions Total Word in Correct Transcript Alignment example REF portable PHONE UPSTAIRS last night so HYP portable FORM OF STORES last night so Eval I S S WER 100 1 2 0 6 50 01 14 19 Speech and Language Processing Jurafsky and Martin 6 Word Error Rate 100 Insertions Substitutions Deletions Total Word in Correct Transcript Alignment example REF portable phone upstairs last night so HYP preferable form of stores next light so far Eval S I S S S S I WER 100 1 5 1 6 117 NIST sctk 1 3 scoring softare Computing WER with sclite http www nist gov speech tools Sclite aligns a hypothesized text HYP from the recognizer with a correct or reference text REF human transcribed id 2347 b 013 Scores C S D I 9 3 1 2 REF was an engineer SO I i was always with MEN UM and they HYP was an engineer AND i was always with THEM THEY ALL THAT and they Eval D S I I S S 01 14 19 Speech and Language Processing Jurafsky and Martin 8 Sclite output for error analysis CONFUSION PAIRS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 01 14 19 6 6 5 4 4 4 4 3 3 3 3 3 3 3 3 3 Total With hesitation on the that but that a the four for in and there that hesitation and hesitation the a i and i and in are there as is have that is this 972 1 occurances 972 Speech and Language Processing Jurafsky and Martin 9 Sclite output for error analysis 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 01 14 19 3 it that 3 mouse most 3 was is 3 was this 3 you we 2 hesitation 2 hesitation 2 hesitation 2 hesitation 2 a all 2 a know 2 a you 2 along well 2 and it 2 and we 2 and you 2 are i 2 are were it that to yeah Speech and Language Processing Jurafsky and Martin 10 Other Types of Error Analysis What speakers are most often misrecognized Doddington 98 Sheep speakers who are easily recognized Goats speakers who are really hard to recognize Lambs speakers who are easily impersonated Wolves speakers who are good at impersonating others What context dependent phones are least well recognized Can we predict this What words are most confusable confusability matrix Can we predict this Are there better metrics than WER WER useful to compute transcription accuracy But should we be more concerned with meaning semantic error rate Good idea but hard to agree on approach Applied mostly in spoken dialogue systems where semantics desired is clear What ASR applications will be different Speech to speech translation Medical dictation systems 01 14 19 Speech and Language Processing Jurafsky and Martin 13 Concept Accuracy Spoken Dialogue Systems often based on recognition of Domain Concepts Input I want to go to Boston from Baltimore on September 29 Goal Maximize concept accuracy total number of domain concepts in reference transcription of user input Concept Value Source City Baltimore Target City Boston Travel Date Sept 29 CA Score How many domain concepts were correctly recognized of total N mentioned in reference transcription Reference I want to go from Boston to Baltimore on September 29 Hypothesis Go from Boston to Baltimore on December 29 2 concepts correctly recognized 3 concepts in ref transcription 100 66 Concept Accuracy What is the WER 3 Ins 2 Subst 0Del 11 100 45 WER 55 Word Accuracy Sentence Error Rate Percentage of sentences with at least one error Transcription error Concept error Which Metric is Better Transcription accuracy Semantic accuracy Next Class Human speech perception
View Full Document
Unlocking...