Trainable VideorealisticSpeech AnimationTony Ezzat, Gadi Geiger, Tomas Poggio@ MITPresented by: Yinan Fan04-26-2007News Coveragez (July 23, 2002) REPORTS FROM SIGGRAPH-2002 - Wendy Ju: Character Animation z (July 2, 2002) CNN : Video Research at MIT Puts Words into Mouthsz (June 30, 2002) ASSOCIATED PRESS - Theo Emery: Video Research at MIT Puts Words into Mouths, with Startling Resultsz (June 17, 2002) THE DISCOVERY CHANNEL [Toronto, Canada] - Jennifer Scott Video:* Science, Lies & Videotapez (May 28, 2002) DER SPIEGEL [Germany] - Marco Evers: Videomanipulation: Wie BilderLuegen Lernenz (May 20, 2002) NBC TODAY SHOW - Katie Couric: Video:* (100 Kbps)(300 Kbps) z (May 20, 2002) MIT NEWS OFFICE: TECH TALK - Deborah Halber: Realistic Animation of Human Face Makes Simulated Talking Look Realz (May 16, 2002) NPR - "All Things Considered" - Robert Siegel: Audio: MIT Video Lipsyncz (May 16, 2002) TORONTO GLOBE & MAIL - Graeme Smith: Computers Fake Moving Mouthsz (May 15, 2002) BOSTON GLOBE - Gareth Cook : At MIT, They Can Put Words in Our MouthsBackgroundz Facial Modelingz 3D methodsz Image-based methods: photorealistic?,videorealistic? Parsimonius?z Video Rewritez Speech Animationz Keyframez Physics-basedz Machine learning methodsz Problem: Motion, smoothness, dynamics, coarticulationeffects…z MMMMMM?Well in some sense,… yes…corpus…...preprocessed and sorted…principle component selected…relationship?graph? MM space?data analysis…some new stuff!MMM,…seriouslyz Morphable Model Representation z A low-dimensional space --paramaterized by shape parametersαand appearance parameters βz A ``black box'' capable of performingz Synthesis z AnalysisSystem OverviewCorpusz A human subject uttering various uttaerace, in neutral expressionz 640*480 of 29.97 fps NTSC, 44.1KHzz 15 minutes, 30000 framesz 152 one syllable wordsz 156 two syllable wordsz 105 short sentencesPre-processingz Audio phonetically aligned(using CMU Sphinx system)z Each image normalized ----head maskz Planar perspective deformationz Eye maskMasksz The only manual workMMM: Definitionz A set of prototype images z A set of prototype flows z Using coarse-to-fine, gradient based optical flow algorithm 1{}NiiI=1{}NiiC=() { (), ()}iiixyCp d pd p=Building MMMz Task: choose image prototypes and compute correspondenceBuilding MMMz EM-PCAz 15 PCA dimensionsz Ij Î pjz K-Means Clusteringz Mahalonobis distance metric:z N=46: No explicit relationship to visemsz Dijkstraz Corpus graphz K-nearest neighbor frames (k=20), weighted by MDz Dijkstra shortest path => 46 correspondencesSynthesisz Goal: Map (α,β) to an image in MMMzα: 46-dimensional Æ mouth shape zβ: 46-dimensional -> mouth textureSynthesisz Steps:z Synthesize a new correspondence:z Forward Warp aSynthesisz Steps:z Synthesize a new correspondence:z Forward Warp aSynthesisz Steps:z Synthesize a new correspondence:z Forward Warp aSynthesisz Steps:z Synthesize a new correspondence:z Forward Warp aAnalysisz Goal:Project the entire recorded corpus onto the constructed MMM, and produce a time series of parameters (α,β) that represent trajectories of the original mouth motionz Each utterance analyzed with respect to the 92 dimensional MMMAnalysisz Estimate parameter α:Îz N image warps are synthesizedz Estimateβ:Analysis ResultTrajectory Synthesisz Goal:Map from an input phone stream {ptt} to a trajectory of parameters yt=(αt,βt) in MMM space. z Phone stream?={\w\, \w\, \w\, \w\, \uh\, \uh\, \uh\, \uh\, \uh\, \uh\, \n\, \n\, \n\, \n\, \n\} =>word ‘one’151{}ttp=HistogramTrajectory Synthesisz Mathmatically a regularization problem:z Minimization:Trainingz Adjust the means and variance to better reflect the training dataÎTraining ResultPost-ProcessingResultz Demosz Interviews: Discovery,NBCz Another ExampleEvaluationsDiscussionsz Viewing Conditions?z 2D->3Dz Emotionz Better video-realismz Geodesic trajectory
View Full Document