Visual Speech Synthesis by Morphing Visemes

Tony Ezzat¹ &
Tomaso Poggio¹

392 Accesses
64 Citations
3 Altmetric
Explore all metrics

Abstract

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Avidan, S., Evgeniou, T., Shashua, A., and Poggio, T. 1997. Image-based view synthesis by combining trilinear tensors and learning techniques. In VRST' 97 Proceedings, Lausanne, Switzerland, pp. 103–109.
Barron, J.L., Fleet, D.J., and Beauchemin, S.S. 1994. Performance of optical flowtechniques. International Journal of Computer Vision, 12(1):43–77.
Google Scholar
Beier, T. and Neely, S. 1992. Feature-based image metamorphosis. In SIGGRAPH' 92 Proceedings, Chicago, IL, pp. 35–42.
Bergen, J.R. and Hingorani, R. 1990. Hierarchical motion-based frame rate conversion. Technical Report, David Sarnoff Research Center, Princeton, New Jersey.
Google Scholar
Beymer, D., Shashua, A., and Poggio, T. 1993. Example based image analysis and synthesis. Technical Report 1431, MIT AI Lab.
Black, A. and Taylor, P. 1997. The Festival Speech Synthesis System. University of Edinburgh.
Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: Driving visual speech with audio. In SIGGRAPH' 97 Proceedings, Los Angeles, CA.
Burt, P.J. and Adelson, E.H. 1983. The laplacian pyramid as a compact image code. IEEE Trans. on Communications, COM-31(4):532–540.
Google Scholar
Chen, S.E. and Williams, L. 1993. View interpolation for image synthesis. In SIGGRAPH' 93 Proceedings, Anaheim, CA, pp. 279–288.
Cohen, M.M. and Massaro, D.W. 1993. Modeling coarticulation in synthetic visual speech. In N.M. Thalmann and D. Thalmann, (Eds.), Models and Techniques in Computer Animation, Springer-Verlag: Tokyo, pp. 139–156.
Google Scholar
Cootes, T.F., Edwards, G.J., and Taylor, C.J. 1998. Active appearance models. In Proceedings of the European Conference on Computer Vision, Freiburg, Germany.
Cosatto, E. and Graf, H. 1998. Sample-based synthesis of photorealistic talking heads. In Proceedings of Computer Animation' 98, Philadelphia, Pennsylvania, pp. 103–110.
Ezzat, T. and Poggio, T. A morphable model for the human mouth. Technical Report, MIT AI Lab, forthcoming.
Fisher, C.G. 1968. Confusions among visually perceived consonants. Jour. Speech and Hearing Research, 11:796–804.
Google Scholar
Guenter, B., Grimm, C., Wood, D., Malvar, H., and Pighin, F. 1998. Making faces. In SIGGRAPH' 98 Proceedings, Orlando, FL, pp. 55–66.
Horn, B.K.P. and Schunck, B.G. 1981. Determining optical flow. Artificial Intelligence, 17:185–203.
Google Scholar
Jones, M. and Poggio, T. 1998. Multidimensional morphable models: A framework for representing and maching object classes. In Proceedings of the International Conference on Computer Vision, Bombay, India.
Lee, S.Y., Chwa, K.Y., Shin, S.Y., and Wolberg, G. 1992. Image metemorphosis using snakes and free-form deformations. In SIGGRAPH' 92 Proceedings, pp. 439–448.
Lee, Y., Terzopoulos, D., and Waters, K. 1995. Realistic modeling for facial animation. In SIGGRAPH' 95 Proceedings, Los Angeles, California, pp. 55–62.
LeGoff, B. and Benoit, C. 1996. A text-to-audiovisual-speech synthesizer for french. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Philadelphia, USA.
Lim, J. 1990. Two-Dimensional Signal and Image Processing. Prentice Hall: Englewood Cliffs, New Jersey.
Google Scholar
Montgomery, A. and Jackson, P. 1983. Physical characteristics of the lips underlying vowel lipreading performance. Jour. Acoust. Soc. Am., 73(6):2134–2144.
Google Scholar
Moulines, E. and Charpentier, F. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9:453–467.
Google Scholar
Olive, J., Greenwood, A., and Coleman, J. 1993. Acoustics of American English Speech: A Dynamic Approach. Springer-Verlag: New York, USA.
Google Scholar
Owens, E. and Blazek, B. 1985. Visemes observed by hearing-impaired and normal-hearing adult viewers. Jour. Speech and Hearing Research, 28:381–393.
Google Scholar
Parke, F.I. 1974. A parametric model of human faces. Ph.D. Thesis, University of Utah.
Pearce, A., Wyvill, B., Wyvill, G., and Hill, D. 1986. Speech and expression: A computer solution to face animation. In Graphics Interface, Vancouver, pp. 136–140.
Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin, D. 1998. Synthesizing realistic facial expressions from photographs. In SIGGRAPH' 98 Proceedings, Orlando, FL.
Scott, K.C., Kagels, D.S., Watson, S.H., Rom, H., Wright, J.R., Lee, M., and Hussey, K.J. 1994. Synthesis of speaker facial movement to match selected speech sequences. In Proceedings of the Fifth Australian Conference on Speech Science and Technology, Vol. 2, pp. 620–625.
Google Scholar
Seitz, S. and Dyer, C. 1996. View morphing. In SIGGRAPH' 96 Proceedings, pp. 21–30.
Waters, K. and Levergood, T. 1993. Decface: An automatic lipsynchronization algorithm for synthetic faces. Technical report, Digital Equipment Corporation CRL Report.
Watson, S.H., Wright, J.R., Scott, K.C., Kagels, D.S., Freda, D., and Hussey, K.J. 1997. An advanced morphing algorithm for interpolating phoneme images to simulate speech. Jet Propulsion Laboratory, California Institute of Technology.
Wolberg, G. 1990. Digital Image Warping. IEEE Computer Society Press: Los Alamitos, CA.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Biological and Computational Learning, Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA
Tony Ezzat & Tomaso Poggio

Authors

Tony Ezzat
View author publications
You can also search for this author in PubMed Google Scholar
Tomaso Poggio
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ezzat, T., Poggio, T. Visual Speech Synthesis by Morphing Visemes. International Journal of Computer Vision 38, 45–57 (2000). https://doi.org/10.1023/A:1008166717597

Download citation

Issue Date: June 2000
DOI: https://doi.org/10.1023/A:1008166717597

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Realistic Speech-Driven Facial Animation with GANs

Speech Synthesis: Text-To-Speech Conversion and Artificial Voices

Speech Synthesis: Text-To-Speech Conversion and Artificial Voices

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Visual Speech Synthesis by Morphing Visemes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Realistic Speech-Driven Facial Animation with GANs

Speech Synthesis: Text-To-Speech Conversion and Artificial Voices

Speech Synthesis: Text-To-Speech Conversion and Artificial Voices

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now