[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Audio-driven facial animation by joint end-to-end learning of pose and emotion

Published: 20 July 2017 Publication History

Abstract

We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet.
We train our network with 3--5 minutes of high-quality animation data obtained using traditional, vision-based performance capture methods. Even though our primary goal is to model the speaking style of a single actor, our model yields reasonable results even when driven with audio from other speakers with different gender, accent, or language, as we demonstrate with a user study. The results are applicable to in-game dialogue, low-cost localization, virtual reality avatars, and telepresence.

Supplementary Material

ZIP File (a94-karras.zip)
Supplemental files.
MP4 File (papers-0329.mp4)

References

[1]
Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive visual text-to-speech using active appearance models. In Proc. CVPR. 3382--3389.
[2]
Mohamed Benzeghiba, Renato De Mori, Olivier Deroo, Stephane Dupont, Teodora Erbes, Denis Jouvet, Luciano Fissore, Pietro Laface, Alfred Mertins, Christophe Ris, and others. 2007. Automatic speech recognition and speech variability: A review. In Speech Communication, Vol. 49. 763--786.
[3]
Matthew Brand. 1999. Voice Puppetry. In Proc. ACM SIGGRAPH. 21--28.
[4]
Yong Cao, Petros Faloutsos, and Frédéric Pighin. 2003. Unsupervised Learning for Speech Motion Editing. In Proc. SCA. 225--231.
[5]
Yong Cao, Wen C. Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive Speech-driven Facial Animation. ACM Trans. Graph. 24, 4 (2005), 1283--1302.
[6]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. arXiv:1410.0759 (2014).
[7]
E. S. Chuang, F. Deshpande, and C. Bregler. 2002. Facial expression space learning. In Proc. Pacific Graphics. 68--76.
[8]
Michael M. Cohen and Dominic W. Massaro. 1993. Modeling Coarticulation in Synthetic Visual Speech. In Models and Techniques in Computer Animation. 139--156.
[9]
Salil Deena and Aphrodite Galata. 2009. Speech-Driven Facial Animation Using a Shared Gaussian Process Latent Variable Model. In Proc. Symposium on Advances in Visual Computing: Part I. 89--100.
[10]
S. Deena, S. Hou, and A. Galata. 2013. Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model. IEEE Transactions on Multimedia 15, 8 (2013), 1755--1768.
[11]
Zhigang Deng, Shri Narayanan, Carlos Busso, and Ulrich Neumann. 2004. Audio-based Head Motion Synthesis for Avatar-based Telepresence Systems. In Proc. Workshop on Effective Telepresence. 24--30.
[12]
Zhigang Deng, Ulrich Neumann, J. P. Lewis, Tae-Yong Kim, Murtaza Bulut, and Shrikanth Narayanan. 2006. Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces. IEEE TVCG 12, 6 (2006), 1523--1534.
[13]
Sander Dieleman, Jan Schlüter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, and others. 2015. Lasagne: First release. (2015).
[14]
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animator-centric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graph. 35, 4 (2016), 127:1--127:11.
[15]
A. Elgammal and Chan-Su Lee. 2004. Separating style and content on a nonlinear manifold. In Proc. CVPR, Vol. 1. 478--485.
[16]
Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. ACM Trans. Graph. 21, 3 (2002), 388--398.
[17]
Bo Fan, Lei Xie, Shan Yang, Lijuan Wang, and Frank K. Soong. 2016. A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications 75, 9 (2016), 5287--5309.
[18]
Cletus G. Fisher. 1968. Confusions Among Visually Perceived Consonants. JSLHR 11 (1968), 796--804.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852 (2015).
[20]
Gregor Hofer and Korin Richmond. 2010. Comparison of HMM and TMDN Methods for Lip Synchronisation. In Proc. Interspeech. 454--457.
[21]
Pengyu Hong, Zhen Wen, and T. S. Huang. 2002. Real-time Speech-driven Face Animation with Expressions Using Neural Networks. Trans. Neur. Netw. 13, 4 (2002), 916--927.
[22]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167 (2015).
[23]
Jia Jia, Zhiyong Wu, Shen Zhang, Helen M. Meng, and Lianhong Cai. 2014. Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimedia Tools and Applications 73, 1 (2014), 439--461.
[24]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 (2014).
[25]
S. Kshirsagar and N. Magnenat-Thalmann. 2000. Lip synchronization using linear predictive analysis. In Proc. ICME, Vol. 2. 1077--1080.
[26]
John Lewis. 1991. Automated lip-sync: Background and techniques. The Journal of Visualization and Computer Animation 2, 4 (1991), 118--122.
[27]
J. P. Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Fred Pighin, and Zhigang Deng. 2014. Practice and Theory of Blendshape Facial Models. In Eurographics (State of the Art Reports).
[28]
J. P. Lewis and F. I. Parke. 1987. Automated Lip-synch and Speech Synthesis for Character Animation. In Proc. SIGCHI/GI Conference on Human Factors in Computing Systems and Graphics Interface. 143--147.
[29]
K. Liu and J. Ostermann. 2011. Realistic facial expression synthesis for an image-based talking head. In Proc. ICME. 1--6.
[30]
M. Malcangi. 2010. Text-driven avatars based on artificial neural networks and fuzzy logic. Int. J. Comput. 4, 2 (2010), 61--69.
[31]
Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual Character Performance from Speech. In Proc. SCA. 25--35.
[32]
D. W. Massaro, J. Beskow, M. M. Cohen, C. L. Fry, and T. Rodriguez. 1999. Picture my voice: Audio to visual speech synthesis using artificial neural networks. In Proc. AVSP. #23.
[33]
D. W. Massaro, M. M. Cohen, R. Clark, and M. Tabain. 2012. Animated speech: Research progress and applications. In Audiovisual Speech Processing. 309--345.
[34]
Wesley Mattheyses and Werner Verhelst. 2015. Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication 66 (2 2015), 182--217.
[35]
J. Melenchon, E. Martinez, F. De La Torre, and J. A. Montero. 2009. Emphatic Visual Speech Synthesis. IEEE Transactions on Audio, Speech, and Language Processing 17, 3 (2009), 459--468.
[36]
M. Mori. 1970. Bukimi no tani (The uncanny valley). Energy 7, 4 (1970), 33--35.
[37]
T. Öhman and G. Salvi. 1999. Using HMMs and ANNs for mapping acoustic to visual speech. IEEE Journal of Selected Topics in Signal Processing 40, 1 (1999), 45--50.
[38]
Valery A. Petrushin. 1998. How well can People and Computers Recognize Emotions in Speech?. In Proc. AAAI Fall Symp. 141--145.
[39]
D. Schabus, M. Pucher, and G. Hofer. 2014. Joint Audiovisual Hidden Semi-Markov Model-Based Speech Synthesis. IEEE Journal of Selected Topics in Signal Processing 8, 2 (2014), 336--347.
[40]
JL Schwartz and C Savariaux. 2014. No, there is no 150 ms lead of visual speech on auditory speech, but a range of audiovisual asynchronies varying from small audio lead to large audio lag. PLoS Computational Biology 10, 7 (2014).
[41]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (2014), 1929--1958.
[42]
Robert W. Sumner and Jovan Popovic. 2004. Deformation Transfer for Triangle Meshes. ACM Trans. Graph. 23, 3 (2004), 399--405.
[43]
Sarah Taylor, Akihiro Kato, Ben Milner, and Iain Matthews. 2016. Audio-to-Visual Speech Conversion using Deep Neural Networks. In Proc. Interspeech. 1482--1486.
[44]
Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proc. SCA. 275--284.
[45]
Joshua B. Tenenbaum and William T. Freeman. 2000. Separating Style and Content with Bilinear Models. Neural Comput. 12, 6 (2000), 1247--1283.
[46]
Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv:1605.02688 (2016).
[47]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499 (2016).
[48]
M. Alex O. Vasilescu and Demetri Terzopoulos. 2003. Multilinear Subspace Analysis of Image Ensembles. In Proc. CVPR, Vol. 2. 93--99.
[49]
Kevin Wampler, Daichi Sasaki, Li Zhang, and Zoran Popovic. 2007. Dynamic, Expressive Speech Animation from a Single Mesh. In Proc. SCA. 53--62.
[50]
Lijuan Wang and Frank K. Soong. 2015. HMM trajectory-guided sample selection for photo-realistic talking head. Multimedia Tools and Applications 74, 22 (2015), 9849--9869.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics
ACM Transactions on Graphics  Volume 36, Issue 4
August 2017
2155 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3072959
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2017
Published in TOG Volume 36, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio
  2. deep learning
  3. facial animation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)217
  • Downloads (Last 6 weeks)31
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A multi-label classification method based on transformer for deepfake detectionImage and Vision Computing10.1016/j.imavis.2024.105319152:COnline publication date: 30-Jan-2025
  • (2025)MambaTalk: Speech-Driven 3D Facial Animation with MambaMultiMedia Modeling10.1007/978-981-96-2061-6_23(310-323)Online publication date: 9-Jan-2025
  • (2024)Audio-Driven Facial Animation with Deep Learning: A SurveyInformation10.3390/info1511067515:11(675)Online publication date: 28-Oct-2024
  • (2024)Facial Animation Strategies for Improved Emotional Expression in Virtual RealityElectronics10.3390/electronics1313260113:13(2601)Online publication date: 2-Jul-2024
  • (2024)ExpCLIPProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i7.28594(7614-7622)Online publication date: 20-Feb-2024
  • (2024)MimicProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i2.27945(1770-1777)Online publication date: 20-Feb-2024
  • (2024)Self-Diffuser: Research on the Technology of Speech-Driven Facial ExpressionsComputer Science and Application10.12677/csa.2024.14818114:08(236-249)Online publication date: 2024
  • (2024)Speed-Aware Audio-Driven Speech Animation using Adaptive WindowsACM Transactions on Graphics10.1145/369134144:1(1-14)Online publication date: 1-Oct-2024
  • (2024)FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation ModelSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687669(1-11)Online publication date: 3-Dec-2024
  • (2024)EmoSpaceTime: Decoupling Emotion and Content through Contrastive Learning for Expressive 3D Speech AnimationProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696336(1-12)Online publication date: 21-Nov-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media