DOC PREVIEW
TAMU CSCE 689 - ezzat2002videorealisticSpeechAnimation

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Trainable Videorealistic Speech AnimationTony Ezzat∗Gadi Geiger†Center for Biological and Computational LearningMassachusetts Institute of TechnologyTomaso Poggio‡AbstractWe describe how to create with machine learning techniques a gen-erative, videorealistic, speech animation module. A human sub-ject is first recorded using a videocamera as he/she utters a pre-determined speech corpus. After processing the corpus automati-cally, a visual speech module is learned from the data that is capableof synthesizing the human subject’s mouth uttering entirely novelutterances that were not recorded in the original video. The synthe-sized utterance is re-composited onto a background sequence whichcontains natural head and eye movement. The final output is vide-orealistic in the sense that it looks like a video camera recording ofthe subject. At run time, the input to the system can be either realaudio sequences or synthetic audio produced by a text-to-speechsystem, as long as they have been phonetically aligned.The two key contributions of this paper are 1) a variant ofthe multidimensional morphable model (MMM) to synthesize new,previously unseen mouth configurations from a small set of mouthimage prototypes; and 2) a trajectory synthesis technique based onregularization, which is automatically trained from the recordedvideo corpus, and which is capable of synthesizing trajectories inMMM space corresponding to any desired utterance.CR Categories: I.3.7 [Computer Graphics]: Three DimensionalGraphics and Realism—Animation; I.2.10 [Artificial Intelligence]:Vision and Scene Understanding—Video Analysis I.2.10 [ArtificialIntelligence]: Vision and Scene Understanding—MotionKeywords: facial modeling, facial animation, morphing, opticalflow, speech synthesis, lip synchronization.1 OverviewIs it possible to record a human subject with a video camera, pro-cess the recorded data automatically, and then re-animate that sub-ject uttering entirely novel utterances which were not included inthe original corpus? In this work, we present such a technique forachieving videorealistic speech animation.We choose to focus our efforts in this work on the issues relatedto the synthesis of novel video, and not on novel audio synthesis.Thus, novel audio needs to be provided as input to our system. This∗e-mail: [email protected]†e-mail:[email protected]‡e-mail:[email protected] 1: Some of the synthetic facial configurations output by oursystem.audio can be either real human audio (from the same subject or adifferent subject), or synthetic audio produced by a text-to-speechsystem. All that is required by our system is that the audio be pho-netically transcribed and aligned. In the case of synthetic audiofrom TTS systems, this phonetic alignment is readily available fromthe TTS system itself [Black and Taylor 1997]. In the case of realaudio, publicly available phonetic alignment systems [Huang et al.1993] may be used.Our visual speech processing system is composed of two mod-ules: The first module is the multidimensional morphable model(MMM), which is capable of morphing between a small set of pro-totype mouth images to synthesize new, previously unseen mouthconfigurations. The second component is a trajectory synthesismodule, which uses regularization [Girosi et al. 1993] [Wahba1990] to synthesize smooth trajectories in MMM space for anyspecified utterance. The parameters of the trajectory synthesis mod-ule are trained automatically from the recorded corpus using gradi-ent descent learning.Recording the video corpus takes on the order of 15 minutes.Processing of the corpus takes on the order of several days, but,apart from the specification of head and eye masks shown in Figure3, is fully automatic, requiring no intervention on the part of theuser. The final visual speech synthesis module consists of a smallset of prototype images (46 images in the case presented here) ex-tracted from the recorded corpus and used to synthesize all novelsequences.Application scenarios for videorealistic speech animation in-clude: user-interface agents for desktops, TVs, or cell-phones; dig-ital actors in movies; virtual avatars in chatrooms; very low bitratecoding schemes (such as MPEG4); and studies of visual speechproduction and perception. The recorded subjects can be regularpeople, celebrities, ex-presidents, or infamous terrorists.Copyright © 2002 by the Association for Computing Machinery, Inc.Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made ordistributed for commercial advantage and that copies bear this notice and the fullcitation on the first page. Copyrights for components of this work owned byothers than ACM must be honored. Abstracting with credit is permitted. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires priorspecific permission and/or a fee. Request permissions from Permissions Dept,ACM Inc., fax +1-212-869-0481 or e-mail [email protected]. © 2002 ACM 1-58113-521-1/02/0007 $5.00388In the following section, we begin by first reviewing the relevantprior work and motivating our approach.2 Background2.1 Facial ModelingOne approach to model facial geometry is to use 3D methods. Parke[1974] was one of the earliest to adopt such an approach by creat-ing a polygonal facial model. To increase the visual realism of theunderlying facial model, the facial geometry is frequently scannedin using Cyberware laser scanners. Additionally, a texture-map ofthe face extracted by the Cyberware scanner may be mapped ontothe three-dimensional geometry [Lee et al. 1995b]. Guenter [1998]demonstrated recent attempts at obtaining 3D face geometry frommultiple photographs using photogrammetric techniques. Pighin etal. [1998] captured face geometry and textures by fitting a genericface model to a number of photographs. Blanz and Vetter [1999]demonstrated how a large database of Cyberware scans may bemorphed to obtain face geometry from a single photograph.An alternative to the 3D modeling approach is to model the talk-ing face using image-based techniques, where the talking facialmodel is constructed using a collection of example images capturedof the human subject. These methods have the potential of achiev-ing very high levels of videorealism, and are inspired by the recentsuccess of similar sample-based methods for audio speech synthe-sis [Moulines


View Full Document

TAMU CSCE 689 - ezzat2002videorealisticSpeechAnimation

Documents in this Course
slides

slides

10 pages

riccardo2

riccardo2

33 pages

ffd

ffd

33 pages

intro

intro

23 pages

slides

slides

19 pages

p888-ju

p888-ju

8 pages

w1

w1

23 pages

vfsd

vfsd

8 pages

subspace

subspace

48 pages

chapter2

chapter2

20 pages

MC

MC

41 pages

w3

w3

8 pages

Tandem

Tandem

11 pages

meanvalue

meanvalue

46 pages

w2

w2

10 pages

CS689-MD

CS689-MD

17 pages

VGL

VGL

8 pages

ssq

ssq

10 pages

Load more
Download ezzat2002videorealisticSpeechAnimation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ezzat2002videorealisticSpeechAnimation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ezzat2002videorealisticSpeechAnimation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?