DOC PREVIEW
Columbia COMS W4706 - Evaluating Spoken Dialogue Systems

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Evaluating Spoken Dialogue SystemsDialogue System EvaluationSlide 3Evaluating Dialogue SystemsTask SuccessSlide 6Efficiency CostQuality CostAnother Key Quality CostPARADISE: Regress against user satisfactionRegressing against User SatisfactionExperimental ProceduresUser Satisfaction: Sum of Many MeasuresPerformance Functions from Three SystemsPerformance ModelNow that we have a Success MetricRecognizing `Problematic’ DialoguesCorpusDATE Dialogue Act ExtractionFeatures Used in PredictionResultsSummary01/14/19 1Evaluating Spoken Dialogue SystemsJulia HirschbergCS 470601/14/19 2Dialogue System Evaluation•Key point about SLP.•Whenever we design a new algorithm or build a new application, need to evaluate it•Two kinds of evaluation–Extrinsic: embedded in some external task–Intrinsic: some sort of more local evaluation.•How to evaluate a dialogue system?•What constitutes success or failure for a dialogue system?01/14/19 3Dialogue System Evaluation•Need evaluation metric because–1) Need metric to help compare different implementations•Can’t improve it if we don’t know where it fails•Can’t decide between two algorithms without a goodness metric–2) Need metric for “how well a dialogue went” as an input to reinforcement learning:• Automatically improve our conversational agent performance via learning01/14/19 4Evaluating Dialogue Systems•PARADISE framework (Walker et al ’00)•“Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplishedMaximizeMaximizeTask SuccessTask Success Minimize Minimize CostsCostsEfficiencyEfficiencyMeasuresMeasuresQualitativeQualitativeMeasuresMeasures01/14/19 5Task Success•% of subtasks completed•Correctness of each questions/answer/error msg•Correctness of total solution–Attribute-Value matrix (AVM)–Kappa coefficient•Users’ perception of whether task was completed01/14/19 6Task SuccessAttribute Attribute ValueValueSelection CriterionSelection CriterionKim Kim oror Meeting MeetingTimeTime10:30 a.m.10:30 a.m.PlacePlace2D5162D516•Task goals seen as Attribute-Value MatrixELVIS e-mail retrieval taskELVIS e-mail retrieval task (Walker et al ‘97)(Walker et al ‘97)““Find the Find the timetime and and placeplace of your of your meetingmeeting with with KimKim.”.”•Task success can be defined by match between AVM values at end of task with “true” values for AVMSlide from Julia Hirschberg01/14/19 7Efficiency Cost•Polifroni et al. (1992), Danieli and Gerbino (1995) Hirschman and Pao (1993)•Total elapsed time in seconds or turns•Number of queries•Turn correction ratio: –Number of system or user turns used solely to correct errors, divided by total number of turns01/14/19 8Quality Cost•# of times ASR system failed to return any sentence•# of ASR rejection prompts•# of times user had to barge-in•# of time-out prompts•Inappropriateness (verbose, ambiguous) of system’s questions, answers, error messages01/14/19 9Another Key Quality Cost•“Concept accuracy” or “Concept error rate”•% of semantic concepts that the NLU component returns correctly•I want to arrive in Austin at 5:00–DESTCITY: Boston–Time: 5:00•Concept accuracy = 50%•Average this across entire dialogue•“How many of the sentences did the system understand correctly”01/14/19 10PARADISE: Regress against user satisfaction01/14/19 11Regressing against User Satisfaction•Questionnaire to assign each dialogue a “user satisfaction rating”: dependent measure•Cost and success factors: independent measures•Use regression to train weights for each factor01/14/19 12Experimental Procedures•Subjects given specified tasks•Spoken dialogues recorded•Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled•Users specify task solution via web page•Users complete User Satisfaction surveys•Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors01/14/19 13User Satisfaction:Sum of Many MeasuresWas the system easy to understand? (TTS Performance)Did the system understand what you said? (ASR Performance) Was it easy to find the message/plane/train you wanted? (Task Ease)Was the pace of interaction with the system appropriate? (Interaction Pace) Did you know what you could say at each point of the dialog? (User Expertise)How often was the system sluggish and slow to reply to you? (System Response) Did the system work the way you expected it to in this conversation? (Expected Behavior) Do you think you'd use the system regularly in the future? (Future Use)01/14/19 14Performance Functions from Three Systems•ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET•TOOT User Sat.= .35* COMP + .45* MRS - .14*ET•ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help–COMP: User perception of task completion (task success)–MRS: Mean (concept) recognition accuracy (cost)–ET: Elapsed time (cost)–Help: Help requests (cost)01/14/19 15Performance Model•Perceived task completion and mean recognition score (concept accuracy) are consistently significant predictors of User Satisfaction•Performance model useful for system development–Making predictions about system modifications–Distinguishing ‘good’ dialogues from ‘bad’ dialogues–Part of a learning model01/14/19 16Now that we have a Success Metric•Could we use it to help drive automatic learning?–Methods for automatically evaluating system performance–Way of obtaining training data for further system development01/14/19 17Recognizing `Problematic’ Dialogues•Hastie et al, “What’s the Trouble?” ACL 2002•Motivation: Identify a Problematic Dialogue Identifier (PDI) to classify dialogues•What is a Problematic Dialogue–Task is not completed–User satisfaction is low•Results: –Identify dialogues in which task not completed with 85% accuracy–Identify dialogues with low user satisfaction with 89% accuracyCorpus•1242 recorded dialogues from DARPA Communicator Corpus–Logfiles with events for each user turn–ASR and hand transcriptions–User information: dialect–User Satisfaction survey–Task Completion labels•Goal is to predict•User Satisfaction (5-25 pts)•Task Completion (0,1,2): none, airline task, airline+ground taskDATE Dialogue Act Extraction01/14/19 20Features Used in Prediction01/14/19


View Full Document

Columbia COMS W4706 - Evaluating Spoken Dialogue Systems

Download Evaluating Spoken Dialogue Systems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Evaluating Spoken Dialogue Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Evaluating Spoken Dialogue Systems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?