Spoken Dialog System Architecture Joshua Gordon CS4706 Outline Examples of deployed and research SDS architectures conversational speech interfaces Discussion of the issues and challenges in SDS design A tour of the Olympus SDS Architecture and a flyby of basic design considerations pertinent to Recognition Spoken language understanding Dialog management error handling belief updating Language generation speech synthesis Interaction management turn taking Information Seeking Transaction Based Spoken Dialog Systems Where we are most of today s production systems are designed for database access and call routing Columbia CheckItOut virtual librarian CMU Let s Go Pittsburg bus schedules Google Goog411 directory assistance Google Voice Search MIT Jupiter weather information Nuance built to order technical support Speech Aware Kiosks Current research at Microsoft SDS architectures are beginning to incorporate multimodal input How may I help you I can provide directory assistance and directions around campus Speech Interfaces to Virtual Characters SDS architectures are exploring multimodal output including gesturing and facial expression to indicate level of understanding Negotiate an agreement between soldiers and village elders Both auditory and visual cues used in turn taking Prosody facial expressions convey emotion SGT Blackwell http ict usc edu projects sergeant blackwell Speech Interfaces to Robotic Systems Next generation systems explore ambitious domains www cellbots com User Fly to the red house and photograph the area Speech Aware Appliances Speech aware appliances are beginning to engage in limited dialogs Interactive dialogs disambiguation are required by multi field queries ambiguity in results Expected What user actually said Play artist Glenn Miller Glenn Miller jazz Play song all rise All rise I guess from blues Human Human vs Human Machine Speech Is recognition performance the limiting factor Challenges exist in computationally describing conversational phenomena for instance Evolving discourse structure Consider answering a question with a question Turn taking Auditory cues let alone gesture are important listen to two speakers competing for the conversational floor Grounding Prosody and intonation contours indicate our level of understanding Research in SDS architectures address frameworks to capture the above long way to go before we achieve human like conversational partners Other issues SDS lack ability to effectively communicate their capabilities and limitations as conversational partners An Architecture for a Virtual Librarian Domain of interest The Andrew Heiskell Braille and Talking Book Library Ability to browse and order books by phone there s 70 000 of them Callers have relatively disfluent speech Anticipate poor recognizer performance The CMU Olympus Framework a freely available actively developed open source collection of dialog system components Origins in the earlier Communicator project The Olympus Architecture Pipeline format subsequent layers increase abstraction Signals to words words to concepts concepts to actions Detail Hub Architecture Deployed and almost deployed Olympus Systems System Domain Users Interaction Vocab Lets Go Public Pittsburg Bus Route Information General public Information access system initiative background noise 2000 words Team Talk Robot Coordination and Control Treasure hunting Grad students researchers Multiparticipant command and control 500 words CheckItO ut Virtual Librarian for the Andrew Elderly vision impaired Information access mixed initiative Variable 10 000 Speech recognition Why ASR is Difficult for SDS A SDS must accommodate variability in Environments Background noise cell phone interference VOIP Speech production Disfluency false starts filled pauses repeats corrections accent age gender differences between human human and human machine speech The caller s technological familiarity with dialog systems in general and with a particular SDS s capabilities and constraints callers often use OOV out of domain concepts The Sphinx Open Source Recognition Toolkit Pocket sphinx Pocket sphinx is efficient runs on embedded devices Continuous speech speaker independent recognition system Includes tools for language model compilation pronunciation and acoustic model adaptation Provides word level confidence annotation n best lists Olympus supports parallel decoding engines models Typically runs parallel acoustic models for male and female speech http cmusphinx sourceforge net Language and Acoustic Models Sphinx supports statistical class and state based language models Statistical language models assign n gram probabilities to word sequences Class based models assign probabilities to collections of terminals e g I would like to read book State based LM switching limit the perplexity of the language model by constraining it to anticipated words confirmation rejection help address books Olympus includes permissive license WSJ Acoustic models read speech for male and female speech at 8khz and 16hkz bandwidth Tools for acoustic adaptation ASR introduces uncertainty SDS architectures always operate on partial information Managing that uncertainty is one of the main challenges How you say it often conveys as much information as what is said Prosody intonation amplitude duration Moving from an acoustic signal to a lexical representation implies information loss Information provided to downstream components A lexical representation of the speech signal with acoustic confidence and language model fit scores An N best list Spoken Language Understanding From words to concepts SLU the task of extracting meaning from utterances Dialog acts the overall intent of an utterance Domain specific concepts frame slots Challenge for the library domain the words in the 70k titles cover a subset of conversational English Vocabulary confusability Very difficult under noisy conditions Does the library have Hitchhikers Guide to the Galaxy by Douglas Adams on audio cassette Dialog Act Book Request Title The Hitchhikers Guide to the Galaxy Author Douglas Adams Media Audio Cassette SLU Challenges faced by SDS There are many many possible ways to say the same thing How can SDS designers anticipate all of them SLU can be greatly simplified by constraining what the user can say and how they can say it But results in a less habitable clunky conversation Who wants to chat with a system like that Recognizer error background noise resulting in indels
View Full Document
Unlocking...