MIT 9 520 - Driving Visual Speech with Audio - D799969

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Driving Visual Speech with Audio

DOC PREVIEW

MIT 9 520 - Driving Visual Speech with Audio

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ACM SIGGRAPH 971 ABSTRACT Video Rewrite uses existing footage to create automatically newvideo of a person mouthing words that she did not speak in theoriginal footage. This technique is useful in movie dubbing, forexample, where the movie sequence can be modified to sync theactors’ lip motions to the new soundtrack.Video Rewrite automatically labels the phonemes in the train-ing data and in the new audio track. Video Rewrite reorders themouth images in the training footage to match the phonemesequence of the new audio track. When particular phonemes areunavailable in the training footage, Video Rewrite selects the clos-est approximations. The resulting sequence of mouth images isstitched into the background footage. This stitching process auto-matically corrects for differences in head position and orientationbetween the mouth images and the background footage.Video Rewrite uses computer-vision techniques to track pointson the speaker’s mouth in the training footage, and morphing tech-niques to combine these mouth gestures into the final videosequence. The new video combines the dynamics of the originalactor’s articulations with the mannerisms and setting dictated bythe background footage. Video Rewrite is the first facial-animationsystem to automate all the labeling and assembly tasks required toresync existing footage to a new soundtrack. CR Categories: I.3.3 [Computer Graphics]: Picture/Image Gener-ation—Morphing; I.4.6 [Image Processing]: Segmentation—Fea-ture Detection; I.3.8 [Computer Graphics]: Applications—FacialSynthesis; I.4.10 [Image Processing]: Applications—FeatureTransformations. Additional Keywords: Facial Animation, Lip Sync. 1 WHY AND HOW WE REWRITE VIDEO We are very sensitive to the synchronization between speech andlip motions. For example, the special effects in Forest Gump arecompelling because the Kennedy and Nixon footage is lip synchedto the movie’s new soundtrack. In contrast, close-ups in dubbedmovies are often disturbing due to the lack of lip sync. VideoRewrite is a system for automatically synthesizing faces withproper lip sync. It can be used for dubbing movies, teleconferenc-ing, and special effects.Video Rewrite automatically pieces together from old footagea new video that shows an actor mouthing a new utterance. Theresults are similar to labor-intensive special effects in ForestGump . These effects are successful because they start from actualfilm footage and modify it to match the new speech. Modifyingand reassembling such footage in a smart way and synchronizing itto the new sound track leads to final footage of realistic quality.Video Rewrite uses a similar approach but does not require labor-intensive interaction.Our approach allows Video Rewrite to learn from examplefootage how a person’s face changes during speech. We learn whata person’s mouth looks like from a video of that person speakingnormally. We capture the dynamics and idiosyncrasies of her artic-ulation by creating a database of video clips. For example, if awoman speaks out of one side of her mouth, this detail is recreatedaccurately. In contrast, most current facial-animation systems relyon generic head models that do not capture the idiosyncrasies of anindividual speaker.To model a new person, Video Rewrite requires a small num-ber (26 in this work) of hand-labeled images. This is the onlyhuman intervention that is required in the whole process. Even thislevel of human interaction is not a fundamental requirement: Wecould use face-independent models instead [Kirby90, Covell96].Video Rewrite shares its philosophy with concatenative speechsynthesis [Moulines90]. Instead of modeling the vocal tract, con-catenative speech synthesis analyzes a corpus of speech, selectsexamples of phonemes, and normalizes those examples. Phonemesare the distinct sounds within a language, such as the /IY/ and /P/in “teapot.” Concatenative speech synthesizes new sounds by con-catenating the proper sequence of phonemes. After the appropriatewarping of pitch and duration, the resulting speech is naturalsounding. This approach to synthesis is data driven: The algo-rithms analyze and resynthesize sounds using little hand-codedknowledge of speech. Yet they are effective at implicitly capturingthe nuances of human speech.Video Rewrite uses a similar approach to create new sequencesof visemes. Visemes are the visual counterpart to phonemes.Visemes are visually distinct mouth, teeth, and tongue articulationsfor a language. For example, the phonemes /B/ and /P/ are visuallyindistinguishable and are grouped into a single viseme class. Video Rewrite: Driving Visual Speech with Audio Christoph Bregler, Michele Covell, Malcolm SlaneyInterval Research Corporation1801 Page Mill Road, Building C, Palo Alto, CA, 94304. E-mail:[email protected], [email protected], [email protected]. See the SIGGRAPH Video Proceedings or http://www.interval.com/papers/1997-012/ for the latest animations.Figure 1: Overview of analysis stage. Video Rewrite usesthe audio track to segment the video into triphones. Visiontechniques find the orientation of the head, and the shapeand position of the mouth and chin in each image. In thesynthesis stage, Video Rewrite selects from this videomodel to synchronize new lip videos to any given audio.Video Model/EH-B-AA/ /IY-B-AA//OW-B-AA//AA-B-AA/PhonemeLabelingVisualLabeling Permission to make digital/hard copy of all or part of this materialfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advan-tage, the copyright notice, the title of the publication and its dateappear, and notice is given that copying is by permission of ACM,Inc. To copy otherwise, to republish, to post on servers, or to redis-tribute to lists, requires prior specific permission and/or a fee.© 1997 ACM.ACM SIGGRAPH 972 Video Rewrite creates new videos using two steps: analysis ofa training database and synthesis of new footage. In the analysis stage, Video Rewrite automatically segments into phonemes theaudio track of the training database. We use these labels to segmentthe video track as well. We automatically track facial features inthis segmented footage. The phoneme and facial labels togethercompletely describe the visemes in the training database. In the synthesis stage, our system uses this video database, along with anew utterance. It automatically retrieves the appropriate visemesequences, and blends

View Full Document