UCI ICS 227 - Ten Myths of Multimodal Interaction

Unformatted text preview:

74 November 1999/Vol. 42, No. 11 COMMUNICATIONS OF THE ACMMultimodal systems process combined natural input modes—such asspeech, pen, touch, hand gestures, eye gaze, and head and body movements—ina coordinated manner with multimedia system output. These systems represent a new direction for com-puting that draws from novel input and output technologies currently becoming available. Since the appear-ance of Bolt’s [1] “Put That There” demonstration system, which processed speech in parallel with manualpointing, a variety of multimodal systems has emerged. Some rudimentary ones process speech combinedwith mouse pointing, such as the early CUBRICON system [8]. Others recognize speech while determin-ing the location of pointing from users’ manual gestures or gaze [5].Moving from traditional interfaces toward interfaces offering users greater expressive power,naturalness, and portability.Recent multimodal systems now recognize abroader range of signal integrations, which are nolonger limited to the simple point-and-speak combi-nations handled by earlier systems. For example, theQuickset system integrates speech with pen input thatincludes drawn graphics, symbols, gestures, andpointing. It uses a semantic unification process tocombine the meaningful multimodal informationcarried by two input signals, both of which are richand multidimensional. Quickset also uses a multi-agent architecture and runs on a handheld PC [3]. Figure 1 illustrates Quickset’s response to the multi-modal command “Airstrips... facing this way, facingthis way, and facing this way,” which was spokenwhile the user drew arrows placing three airstrips incorrect orientation on a map.Multimodal systems represent a research-level par-adigm shift away from conventional windows-icons-menus-pointers (WIMP) interfaces toward providingusers with greater expressive power, naturalness, flexi-bility, and portability. Well-designed multimodal sys-tems integrate complementary modalities to yield ahighly synergistic blend in which the strengths of eachmode are capitalized upon and used to overcomeweaknesses in the other. Such systems potentially canfunction more robustly than unimodal systems thatinvolve a single recognition-based technology such asspeech, pen, or vision.Sharon OviattTen Myths ofMultimodalInteractionCOMMUNICATIONS OF THE ACM November 1999/Vol. 42, No. 11 75Systems that process multimodal input also aim togive users better tools for controlling the sophisticatedvisualization and multimedia output capabilities thatalready are embedded in many systems. In contrast,keyboard and mouse input are relatively limited andimpoverished, especially when interacting with virtualenvironments, animated characters, and the like. Inthe future, more balanced systems will be needed inwhich powerful input and output capabilities are bet-ter matched with one another.As a new generation of multimodal systems beginsto define itself, one dominant theme will be the inte-gration and synchronization requirements for com-bining different modes strategically into wholesystems. The computer science community is justbeginning to understand how to design well inte-grated and robust multimodal systems. The develop-ment of such systems will not be achievable throughintuition alone. Rather, it will depend on knowledgeof the natural integration patterns that typify people’scombined use of different input modes. This meansthat the successful design of multimodal systems willrequire guidance from cognitive science on the coor-dinated human perception and production of naturalmodalities. In this respect, multimodal systems canflourish only through multidisciplinary cooperation,as well as through teamwork among those with exper-tise in individual component technologies.Multimodal Interaction: Separating Myth from Empirical RealityIn this article, 10 myths about multimodal interac-tion are identified as currently fashionable amongcomputationalists and are discussed from the perspec-tive of contrary empirical evidence. Current informa-tion about multimodal interaction is summarizedfrom research on multimodal human-computer inter-action, and from the linguistics literature on naturalmultimodal communication. In the process of uncov-ering misconceptions associated with each myth,information is highlighted on multimodal integrationpatterns and their temporal synchrony, the informa-tion carried by different input modes, the processibil-ity of users’ multimodal language, differences amongusers in multimodal integration patterns, and the reli-ability and other general advantages of multimodalsystem design. This state-of-the-art information isdesigned to replace popularized myths with a moreNORMAND COUSINEAUaccurate foundation for guiding the design of next-generation multimodal systems.Myth #1: If you build a multimodal system, userswill interact multimodally. Users have a strong pref-erence to interact multimodally rather than uni-modally, although this preference is most pronouncedin spatial application domains [10]. For example, 95%to 100% of users preferred to interact multimodallywhen they were free to use either speech or pen inputin a spatial domain [10]. However, just becauseusers prefer to interact multimodally is no guar-antee that they will issue every command to a sys-tem multimodally. Instead, they typicallyintermix unimodal and multimodal expressions.In a recent study, users’ commands wereexpressed multimodally 20% of the time, withthe rest just spoken or written [12].Predicting whether a user will express a com-mand multimodally also depends on the type ofaction they are performing. In particular, theyalmost always express commands multimodallywhen describing spatial information about thelocation, number, size, orientation, or shape ofan object. In the data summarized in Figure 2,users issued multimodal commands 86% of thetime when they had to add, move, modify, orcalculate the distance between objects on a mapin a way that required specifying spatial locations[12]. They also were moderately likely to inter-act multimodally when selecting an object from alarger array—for example, when deleting a particularobject from the map. However, when performinggeneral actions without any spatial component, suchas printing a map, users rarely expressed themselvesmultimodally—less than 1% of the time [12].To summarize, users like being able to interactmultimodally, but they don’t always do so. Their nat-ural communication


View Full Document

UCI ICS 227 - Ten Myths of Multimodal Interaction

Download Ten Myths of Multimodal Interaction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Ten Myths of Multimodal Interaction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Ten Myths of Multimodal Interaction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?