BYU CS 656 - UniversalSpeech-rosenfeld-olsen

Unformatted text preview:

34interactions...november + december 2001Universal SpeechInterfacesUniversal SpeechInterfacesRonald Rosenfeld, Dan Olsen and Alex Rudnicky In recent years speech recognition has become commercially viableon off-the-shelf computers—a goal that has long been sought byboth the research community and by prospective users. Anyonewho has used speech recognition technology understands that ithas many flaws and much remains to be done. Uncertainty existsabout how speech can and should be used, as well as how recognition algorithms and tools for speech-based applicationsshould be developed. Achieving reliable, accurate speech recogni-tion is similar to building an inexpensive mouse and keyboard.The underlying input technology is available, but the question ofhow to build the application interface remains. We have beenconsidering these problems for some time [8]. In this paper wepresent some of our thoughts about the future of speech-basedinteraction. This is not a report of results we have obtained, butrather a vision of a future to be explored.SpeechRecognitionFrom the inception ofspeech recognition research,achieving “natural” interaction withcomputers has been cited as the primarybenefit. The concept of talking with amachine as fluently and comfortably as withanother human being has attracted fundingand interest. We wanted to create HAL from2001: A Space Odyssey with a gentler person-ality. Naturalness of communication, howev-er, is not the only goal of speech recognition,nor is naturalness reserved for speech. Basedon the idea that “a picture is worth a thou-sand words,” graphical user interfaces (GUI)have been developed purposefully for theirnaturalness. If naturalness is not the key driv-er for speech, what is? We think there are atleast three fundamental advantages forspeech.1. Speech is an ambient medium ratherthan an attentional one. Visual activityrequires our focused attention, whereasspeech allows us to interact while usingour other faculties (e.g., visual, sensory-motor) to do something else.2. Speech is descriptive rather than refer-ential. When we speak, we describeobjects by their roles and attributes. Invisual situations we point to or graspobjects of interest. For this reason, speechand pointing are to a large extent comple-mentary and can often successfully becombined.3. Speech requires modest physicalresources. Speech-based interaction canbe scaled down to much smaller andmuch cheaper form factors than can visualor manual modalities.We see a future for speech, not so much forits naturalness but for its ubiquity. Natural-language interfaces are still an interesting andimportant topic, but speech interaction isboth more and less than that. For this reasonwe choose to focus on speech as an input-out-35interactions...november + december 2001Ronald Rosenfeld,School of Computer ScienceCarnegie Mellon [email protected] Olsen, Brigham Young [email protected] Rudnicky, School of Computer ScienceCarnegie Mellon [email protected]© Brook Trout Studio/Images.com, Inc.36interactions...november + december 2001put modality rather than as a medium for nat-ural language. We see interactive systems asmechanisms for humans to express needs toand obtain services from machines. Bymachines we mean not only computers in thepopular sense, but also any gadget, appliance,or automated service that, in order to be fullyused, must be reconfigured, controlled,queried, or otherwise communicated with.We are surrounded by dozens of suchmachines today. The exponential drop in thefundamental cost of computing willcause hundreds more to be devel-oped in the near future.Examples of such interac-tivity include:✱ Configuringand usinghome appli-ances (video-cassetterecorders,microwave andconvectionovens, radios,alarms);✱ Configuring and usingoffice machines (faxmachines, copiers, telephones);✱ Retrieving public information (e.g.,weather, news, flight schedule, stockquotes);✱ Retrieving and manipulating privateinformation (e.g., bank or otheraccounts, personal scheduler, contactmanager, other private databases);✱ Handling asynchronous communica-tion (voice, e-mail, fax); and ✱ Controlling miscellaneous user andconsumer applications (map following,form filling, Web navigation).To clarify our ideas, we distinguishbetween intelligent machines and simplemachines. Suppose that you had a householdrobot like that envisioned by Isaac Asimov.While seated at the table, you might hand adirty dish to the robot and say, “Take care ofthis.” This is a natural statement that mostchildren would understand. The desire is tohave the dish taken to the kitchen, rinsed, andplaced in the dishwasher. The naturalness ofthis communication derives from the largeamount of world knowledge that the robotmust possess to infer the correct response.This is not so much an interaction problem asit is a natural language and inferential reason-ing problem. This problem is not necessarilyrelated to speech. A more natural form ofcommunication would be to hand the dirtydish to the robot and then expect the rightthing to happen without any speech at all. Wewould consider this an intelligent machine.For speech interfaces we focus on simplemachines. With a simple machinethe user can, at least in princi-ple, get a mental model ofthe machine’s capabilitiesand of the machine’srough state. Moreover,the user is assumed toalready know what isdesired, although heor she does not have toknow how to get itdone. Under thisparadigm, high-level intel-ligent problem solving isperformed by the human; themachine is only a tool for gettingneeded information, modifying it, or issuinginstructions to the desired service. We findspeech interaction with simple machines to beinteresting because it sidesteps the artificialintelligence problems of natural language infavor of the ubiquitous nature of speech inter-action. We believe that in the future humanbeings will be surrounded by hundreds if notthousands of simple machines with whichthey will want to interact.In particular, the approach we are propos-ing is not aimed at applications requiring tru-ly intelligent communication. For example, anair travel reservation system will fall withinour focus only if the machine is used to con-sult flight schedules and fares and to bookflights and the user plans and makes decisions.The machine acts as a passive travel agent thatdoes not do much thinking on its own butmostly carries out the explicit requests of theuser. The


View Full Document
Download UniversalSpeech-rosenfeld-olsen
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view UniversalSpeech-rosenfeld-olsen and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view UniversalSpeech-rosenfeld-olsen 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?