More Web Proxy on the site http://driver.im/

research-article

Video Rewrite: Driving Visual Speech with Audio

Authors:

Christoph Bregler,

Michele Covell,

Malcolm SlaneyAuthors Info & Claims

Seminal Graphics Papers: Pushing the Boundaries, Volume 2

August 2023

Article No.: 75, Pages 715 - 722

https://doi.org/10.1145/3596711.3596787

Published: 02 August 2023 Publication History

Editorial Notes

This paper was originally published as https://doi.org/10.1145/258734.258880.

Abstract

Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage. This technique is useful in movie dubbing, for example, where the movie sequence can be modified to sync the actors' lip motions to the new soundtrack.

Video Rewrite automatically labels the phonemes in the training data and in the new audio track. Video Rewrite reorders the mouth images in the training footage to match the phoneme sequence of the new audio track. When particular phonemes are unavailable in the training footage, Video Rewrite selects the closest approximations. The resulting sequence of mouth images is stitched into the background footage. This stitching process automatically corrects for differences in head position and orientation between the mouth images and the background footage.

Video Rewrite uses computer-vision techniques to track points on the speaker's mouth in the training footage, and morphing techniques to combine these mouth gestures into the final video sequence. The new video combines the dynamics of the original actor's articulations with the mannerisms and setting dictated by the background footage. Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack.

References

[1]

T. Beier, S. Neely. Feature-based image metamorphosis. Computer Graphics, 26(2):35--42, 1992. ISSN 0097--8930.

Digital Library

[2]

M.J. Black, Y. Yacoob. Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. Proc. IEEE Int. Conf. Computer Vision, Cam bridge, MA, pp. 374--381, 1995. ISBN 0--8186--7042--8.

[3]

C. Bregler, S. Omohundro. Nonlinear manifold learn ing for visual speech recognition. Proc. IEEE Int. Conf. Com puter Vision, Cambridge, MA, pp. 494--499, 1995. ISBN 0- 8186--7042--8.

[4]

P.J. Burt, E.H. Adelson. A multiresolution spline with application to image mosaics. ACM Trans. Graphics, 2(4): 217--236, 1983. ISSN 0730-0301.

Digital Library

[5]

M.M. Cohen, D.W. Massaro. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Com puter Animation, ed. N.M Thalman, D. Thalman, pp. 139--156, Tokyo: Springer-Verlag, 1993. ISBN 0--3877-0124--9.

[6]

M. Covell, C. Bregler. Eigenpoints. Proc. Int. Conf. Image Processing, Lausanne, Switzerland, Vol. 3, pp. 471-- 474, 1996. ISBN 0--7803--3258-x.

[7]

T. Guiard-Marigny, A. Adjoudani, C. Benoit. A 3-D model of the lips for visual speech synthesis. Proc. ESCA/IEEE Workshop on Speech Synthesis, New Paltz, NY, pp. 49--52, 1994.

[8]

M. Kass, A. Witkin, D. Terzopoulos. Snakes: Active con tour models. Int. J. Computer Vision, 1(4):321--331, 1987. ISSN 0920--5691.

[9]

M. Kirby, L. Sirovich. Application of the Karhunen Loeve procedure for the characterization of human faces. IEEE PAMI, 12(1):103--108, Jan. 1990. ISSN 0162--8828.

Digital Library

[10]

L. F. Lamel, R. H. Kessel, S. Seneff. Speech database development: Design and analysis of the acoustic-phonetic corpus. Proc. Speech Recognition Workshop (DARPA), Report #SAIC-86/1546, pp. 100--109, McLean VA: Science Applica tions International Corp., 1986.

[11]

A. Lanitis, C.J. Taylor, T.F. Cootes. A unified approach for coding and interpreting face images. Proc. Int. Conf. Com puter Vision, Cambridge, MA, pp. 368--373, 1995. ISBN 0- 8186--7042--8.

[12]

J.Lewis. Automated lip-sync: Background and tech niques. J.Visualization and Computer Animation, 2(4):118-- 122, 1991. ISSN 1049--8907.

[13]

P. Litwinowicz, L. Williams. Animating images with drawings. SIGGRAPH 94, Orlando, FL, pp. 409--412, 1994. ISBN 0--89791--667-0.

Digital Library

[14]

S. Morishima, H. Harashima. A media conversion from speech to facial image for intelligent man-machine inter face. IEEE J Selected Areas Communications, 9 (4):594--600, 1991. ISSN 0733--8716.

Digital Library

[15]

E. Moulines, P. Emerard, D. Larreur, J. L. Le Saint Milon, L. Le Faucheur, F. Marty, F. Charpentier, C. Sorin. A real-time French text-to-speech system generating high-quality synthetic speech. Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, pp. 309--312, 1990.

[16]

J.J. Ohala. The frequency code underlies the sound symbolic use of voice pitch. In Sound Symbolism, ed. L. Hin ton, J. Nichols, J. J. Ohala, pp. 325--347, Cambridge UK: Cambridge Univ. Press, 1994. ISBN 0--5214--5219--8.

[17]

E. Owens, B. Blazek. Visemes observed by hearing impaired and normal-hearing adult viewers. J. Speech and Hearing Research, 28:381--393, 1985. ISSN 0022--4685.

[18]

F. Parke. Computer generated animation of faces. Proc. ACM National Conf., pp. 451--457, 1972.

Digital Library

[19]

L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in Speech Recognition, ed. A. Waibel, K. F. Lee, pp. 267--296, San Mateo, CA: Morgan Kaufmann Publishers, 1989. ISBN 1- 5586-0124--4.

[20]

K.C. Scott, D.S. Kagels, S.H. Watson, H. Rom, J.R. Wright, M. Lee, K.J. Hussey. Synthesis of speaker facial movement to match selected speech sequences. Proc. Austra lian Conf. Speech Science and Technology, Perth Australia, pp. 620--625, 1994. ISBN 0--8642--2372--2.

[21]

J. Tenenbaum, W. Freeman. Separable mixture models: Separating style and content. In Advances in Neural Information Processing 9, ed. M. Jordan, M. Mozer, T. Petsche, Cambridge, MA: MIT Press, (in press).

[22]

M. Turk, A. Pentland. Eigenfaces for recognition. J. Cog nitive Neuroscience, 3(1):71--86, 1991. ISSN 0898--929X

[23]

A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Informat. Theory, IT-13:260--269, 1967. ISSN 0018--9448.

Digital Library

[24]

K. Waters, T. Levergood. DECface: A System for Syn thetic Face Applications. J. Multimedia Tools and Applica tions, 1 (4):349--366, 1995. ISSN 1380--7501.

[25]

L. Williams. Performance-Driven Facial Animation. Computer Graphics (Proceedings of SIGGRAPH 90), 24(4):235--242, 1990. ISSN 0097--8930.

[26]

A.L. Yuille, D.S. Cohen, P.W. Hallinan. Feature extrac tion from faces using deformable templates. Proc. IEEE Com puter Vision and Pattern Recognition, San Diego, CA, pp. 104--109, 1989. ISBN 0--8186--1952-x

Cited By

Yu HQu ZYu QChen JJiang ZChen ZZhang SXu JWu FLv CYu GCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681675
Pham TDo TLe NLe NNguyen HTjiputra ETran QNguyen A(2024)Style Transfer for 2D Talking Head Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00745(7500-7509)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00745
Tan SJi BBi MPan Y(2024)EDTalk: Efficient Disentanglement for Emotional Talking Head SynthesisComputer Vision – ECCV 202410.1007/978-3-031-72658-3_23(398-416)Online publication date: 2-Oct-2024
https://doi.org/10.1007/978-3-031-72658-3_23
Show More Cited By

Index Terms

Video Rewrite: Driving Visual Speech with Audio
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Interest point and salient region detections

Recommendations

Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a ...
A Practical and Configurable Lip Sync Method for Games
MIG '13: Proceedings of Motion on Games

We demonstrate a lip animation (lip sync) algorithm for real-time applications that can be used to generate synchronized facial movements with audio generated from natural speech or a text-to-speech engine. Our method requires an animator to construct ...
Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis
Special section: Multimodal human-computer interfaces

This paper presents the whole process of creation of audio-visual speech synthesis system. Such system consists of two main parts, the acoustic synthesis emulating human speech and the facial animation emulating the human lip articulation. The acoustic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Overlay Books

Seminal Graphics Papers: Pushing the Boundaries, Volume 2

August 2023

893 pages

ISBN:9798400708978

DOI:10.1145/3596711

Editor:
Mary C. Whitton
Department of Computer Science, UNC Chapel Hill, USA

SIGGRAPH '97: Proceedings of the 24th annual conference on Computer graphics and interactive techniques
August 1997
512 pages
ISBN:0897918967
Chairmen:
G. Scott Owen
Georgia State Univ., Atlanta
,
Turner Whitted
Numerical Design Ltd.
,
Barbara Mones-Hattal
George Mason Univ., Fair Fax, VA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 August 2023

Permissions

Request permissions for this article.

Request Permissions

Badges

Seminal Paper

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

450
Total Citations
View Citations
3,186
Total Downloads

Downloads (Last 12 months)447
Downloads (Last 6 weeks)62

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu HQu ZYu QChen JJiang ZChen ZZhang SXu JWu FLv CYu GCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681675
Pham TDo TLe NLe NNguyen HTjiputra ETran QNguyen A(2024)Style Transfer for 2D Talking Head Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00745(7500-7509)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00745
Tan SJi BBi MPan Y(2024)EDTalk: Efficient Disentanglement for Emotional Talking Head SynthesisComputer Vision – ECCV 202410.1007/978-3-031-72658-3_23(398-416)Online publication date: 2-Oct-2024
https://doi.org/10.1007/978-3-031-72658-3_23
Chataut RUpadhyay A(2024)Introduction to Deepfake Technology and Its Early FoundationsDeepfakes and Their Impact on Business10.4018/979-8-3693-6890-9.ch001(1-18)Online publication date: 13-Dec-2024
https://doi.org/10.4018/979-8-3693-6890-9.ch001
Liu ZLiu XChen SLiu JWang LBi C(2024)Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action UnitsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367256520:9(1-24)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3672565
Mukhopadhyay SSuri SGadde RShrivastava A(2024)Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00521(5280-5290)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00521
Sheng CKuang GBai LHou CGuo YXu XPietikäinen MLiu L(2024)Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337671046:9(6001-6022)Online publication date: Sep-2024
https://doi.org/10.1109/TPAMI.2024.3376710
Chen YZeng RXiong S(2024)3-D Facial Priors Guided Local–Global Motion Collaboration Transforms for One-Shot Talking-Head Video SynthesisIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332368470:1(132-143)Online publication date: Feb-2024
https://doi.org/10.1109/TCE.2023.3323684
Yavuz C(2024)A Multidisciplinary Look at History and Future of Deepfake With Gartner Hype CycleIEEE Security and Privacy10.1109/MSEC.2024.338032422:3(50-61)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1109/MSEC.2024.3380324
Li JZhang JBai XZheng JZhou JGu L(2024)ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesisInformation Fusion10.1016/j.inffus.2024.102456110(102456)Online publication date: Oct-2024
https://doi.org/10.1016/j.inffus.2024.102456
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents