DOC PREVIEW
Using Visual Speech for Training Chinese Pronunciation: An In-vivo Experiment

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Using Visual Speech for Training Chinese Pronunciation: An In-vivo Experiment Ying Liu1, Dominic W. Massaro2, Trevor H. Chen2, Derek Chan1, and Charles Perfetti1 1Learning Research and Development Center University of Pittsburgh, Pittsburgh, PA15260 2Department of Psychology, Perceptual Science Laboratory University of California, Santa Cruz Santa Cruz, CA 95060 U.S.A. Abstract Recent research showed that our perception and understanding of speech are influenced by a speaker's facial expressions and accompanying gestures, as well as the actual sound of the speech. Perceivers expertly use these multiple sources of information to identify and interpret the language input. Baldi® is a three-dimensional animated talking head appropriately aligned with either synthesized or natural speech. The present in-vivo experiment used Bao, a Chinese version of Baldi, to teach Chinese syllables to adult English native speakers. The result showed that students trained with Baldi, improved more than students trained with ordinary speech. Advantages of the Baldi pedagogy and technology include the popularity and proven effectiveness of computers and embodied conversational agents, the perpetual availability of the program, and individualized instruction. The technological edge of Baldi holds great promise in language learning, dialog, human-machine interaction, education, and edutainment. 1. Introduction In recent years, studies have demonstrated that proper use of multi-modality input can facilitate learning [1]. Current use of technologies makes it possible for language learning to take advantage of the visual modality. The Science Laboratory (PSL-UCSC) aims to create embodied computer-animated agents that produce accurate auditory and visible speech, as well as realistic facial expressions, emotions, and gestures. The invention of such agents has a tremendous potential to benefit virtually all individuals in learning speech and language. Our talking head, Baldi®1, has been used as a vocabulary tutor for children with language 1 Baldi is a trademark of Dominic W. Massaro. challenges, including hearing-challenged and autistic children. Baldi has also been used for speech training of both hard of hearing children and adults learning a second language. The animated characters that we are developing have also been used to train autistic children to “read” visible speech and to recognize emotions in the face and voice [2]. There have been several approaches to facial animation, including muscle models to simulate the muscle and tissues during talking [3], performance-based synthesis that tracks a live speaker[4], and image synthesis that combines together images of a real speaker [5][6]. The facial animation used in the current applications, however, is a descendant of Parke's software and his particular 3-D talking head [7]. Modifications include increased resolution of the underlying wireframe model; additional and modified control parameters that have been tuned to agree with measurements of natural talkers; a realistic tongue trained on electropalatography and ultra-sound data; a tested coarticulation model; paralinguistic information and affect in the face; alignment with either natural speech or text-to-speech synthesis; and real-time bimodal (auditory/visual) synthesis on a commodity personal computer. Most of the parameters move vertices (and the polygons formed from these vertices) on the face by geometric functions such as rotation (e.g. jaw rotation) or translation of the vertices in one or more dimensions (e.g., lower and upper lip height, mouth widening). Other parameters work by scaling and interpolating different facial subareas. Many of the facial shape parameters – such as cheek, neck, or forehead shape, and some affect parameters such as smiling – use interpolation. Phonemes are used as the unit of visible speech synthesis. Any utterance can be represented as a string of successive phonemes, and eachphoneme is represented as a set of target values for the control parameters such as jaw rotation and mouth width. Because speech production is a continuous process involving movements of different articulators (e.g., tongue, lips, jaw) having both mass and inertia, phoneme utterances are influenced by the context in which they occur. This so-called coarticulation is implemented in the synthesis by dominance functions, which determine how much weight its target value carries against those of neighboring segments independently for each control parameter over time [8]. In a test of several coarticulation models, Beskow [9] found that our model gave the best fit to the observed articulatory data. We evaluate the accuracy and intelligibility of Baldi’s synthetic visible speech by perceptual recognition tests given to human observers [10]. These experiments aimed at evaluating the speech intelligibility of the visible speech synthesis relative to natural speech. The goal of the evaluation is to learn how the synthetic visual talker falls short of natural talkers and to modify the synthesis accordingly to bring it more in line with natural visible speech. The intelligibility of Baldi’s visible speech has been successively improved across a number of studies, although overall it still falls short of a good natural talker [11]. The present study is part of a larger foreign language learning project associated with the Pittsburgh Science of Learning Center (PSLC). In a previous study, we carried out one experiment to test the effect of Baldi under laboratory environment [12]. The present experiment was carried out online, including both training and testing. We tested the effectiveness of Baldi vs. human face and voice only in an in-vivo experiment. 2. Method 2.1 Participants One hundred and one students of an introductory Chinese course at Carnegie Mellon University participated this study. They logged into the web based learning program to participate in the study as a course requirement, which account for 5 percent of their final grade. 2.2 Procedure and Materials Each participant was randomly assigned to one of the three training conditions. Each training condition consisted of two sessions of training and two sessions of testing. In the first session, all participants received audio only training by hearing the sound of a native Mandarin speaker (a female speaking a Beijing dialect) pronouncing 23 Mandarin syllables with the Pinyin spelling presented on the screen.


Using Visual Speech for Training Chinese Pronunciation: An In-vivo Experiment

Download Using Visual Speech for Training Chinese Pronunciation: An In-vivo Experiment
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Using Visual Speech for Training Chinese Pronunciation: An In-vivo Experiment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Using Visual Speech for Training Chinese Pronunciation: An In-vivo Experiment 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?