Penn CIS 700 - AMPLITUDE CONVERGENCE IN CHILDRENS CONVERSATIONAL SPEECH WITH ANIMATED PERSONAS

Unformatted text preview:

AMPLITUDE CONVERGENCE IN CHILDREN’S CONVERSATIONAL SPEECH WITH ANIMATED PERSONAS Rachel Coulston, Sharon Oviatt and Courtney Darves Department of Computer Science and Engineering Oregon Health & Science University +1-503-748-1602; {rachel|oviatt|court}@cse.ogi.edu http://www.cse.ogi.edu/CHCC ABSTRACT During interpersonal conversation, both children and adults adapt the basic acoustic-prosodic features of their speech to converge with those of their conversational partner. In this study, 7-to-10-year-old children interacted with a conversational interface in which animated characters used text-to-speech output (TTS) to answer questions about marine biology. Analysis of children’s speech to different animated characters revealed a 29% average change in energy when they spoke to an extroverted loud software partner (E), compared with an introverted soft-spoken one (I). The majority, or 77% of children, adapted their amplitude toward their partner’s TTS voice. These adaptations were bi-directional, with increases in amplitude observed during I to E condition shifts, and decreases during E to I shifts. Finally, these results generalized across different user groups and TTS voices. Implications are discussed for guiding children’s speech to remain within system processing bounds, and for the future development of robust and adaptive conversational interfaces. 1. INTRODUCTION Communication Accommodation Theory (CAT) describes interpersonal conversation as a dynamic adaptive exchange in which a person’s spoken language is tailored to their interlocutor in fundamental ways. Both children and adults will accommodate the spoken language of a conversation partner, including its basic acoustic-prosodic features such as amplitude, pitch and duration [1, 6, 14]. However, research on human computer interaction has yet to investigate whether users likewise adapt their speech during conversational interaction with a computer partner. In the present study, we explore whether children adapt their speech amplitude while conversing with animated characters that respond with different types of TTS voices during an educational exchange. In a companion paper, we also present results on children’s adaptation of speech duration--in particular dialogue response latencies--as they converge with those of a software partner [4]. One goal of this work is the modeling of children’s speech, because it is well known to produce substantially higher recognition error rates than adult speech, typically by a factor of two-to-five fold [11, 15]. Children’s speech currently is difficult to process because it is more disfluent, more variable in acoustic-prosodic features, and changing developmentally [9, 16]. Children also can be shy and hard to engage in conversation, such that they are reluctant to speak at all and low in volume when they do. As a result, the development of future conversational interfaces for children will require specialized design strategies that engage child users, and that guide their speech to be audible and processable by a speech recognition system. Recent work has made progress in designing educational software that effectively engages child users [2, 5], and even increases the amount of language they direct to a conversational interface [3]. In this research, we explore whether the TTS voice of an animated character can be used to entrain children’s speech to be higher or lower in amplitude, which would be expected to influence the robustness of a conversational system. 2. GOALS OF THE STUDY Since TTS acoustic parameters can be controlled precisely, their manipulation provides a unique opportunity to study the dynamics of amplitude accommodation. For this research, introvert and extrovert TTS voices were used, in part because their acoustic-prosodic features are well defined and have been used in previous research. In contrast to an introvert voice profile, extrovert speech typically is louder and faster in rate, exhibits higher pitch and wider pitch range, and shorter dialogue response latencies [8, 13]. The specific goals of this study were to: • Examine whether children’s speech amplitude is influenced by an animated software partner’s TTS output during conversational interaction • Determine whether their amplitude readapts dynamically if a contrasting computer voice is introduced part way through an interaction • Assess whether amplitude adaptation is bi-directional, increasing and decreasing in accord with the TTS amplitude heard • Evaluate the magnitude of amplitude adaptation • Establish the generality of any adaptation effects across different user groups and TTS voices 3. METHODS 3.1. Participants, Task, and Procedure Twenty-four elementary-school children participated in this study as paid volunteers. Participants ranged in age from 7 years, 6 months to 10 years, 2 months, and were gender balanced. The study was conducted at a local elementary school. Children participating in the study were introduced to Immersive Science Education for Elementary kids (I SEE!). Figure 1 illustrates the I SEE! interface. I SEE! is an application in which children could use speech, pen, or multimodal input while conversing with animated software characters to learnabout marine biology. The marine animals were available as conversational partners who answered questions about themselves using text-to-speech (TTS) output. A “Spin the dolphin” character (lower right of Figure 1) also was available to answer questions and provide help (e.g., spelling, using the system) and entertainment (e.g., telling jokes). Before starting a session, each child received instructions and practice with a science teacher on how to use the I SEE! interface on a small hand-held computer. Then the teacher left, and the child spent approximately one hour alone in a quiet classroom playing with the educational software. During this time, he or she conversed with 24 marine animals (e.g., lobster, as shown in Figure 1), which were organized into three task sets of eight animals apiece. During data collection, children’s input was received by an informed assistant who interpreted their queries and provided system responses as part of a simulation method, although children believed they were interacting with a fully-functional system. The simulation environment ran on a PC, and received input from a Fujitsu Stylistic 2300 that was used by the children. Details of the


View Full Document

Penn CIS 700 - AMPLITUDE CONVERGENCE IN CHILDRENS CONVERSATIONAL SPEECH WITH ANIMATED PERSONAS

Documents in this Course
Load more
Download AMPLITUDE CONVERGENCE IN CHILDRENS CONVERSATIONAL SPEECH WITH ANIMATED PERSONAS
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view AMPLITUDE CONVERGENCE IN CHILDRENS CONVERSATIONAL SPEECH WITH ANIMATED PERSONAS and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view AMPLITUDE CONVERGENCE IN CHILDRENS CONVERSATIONAL SPEECH WITH ANIMATED PERSONAS 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?