[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3596711.3596787acmoverlaybooksArticle/Chapter ViewAbstractPublication PagesEdited Collectionacm-pubtype
research-article

Video Rewrite: Driving Visual Speech with Audio

Published: 02 August 2023 Publication History

Editorial Notes

This paper was originally published as https://doi.org/10.1145/258734.258880.

Abstract

Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage. This technique is useful in movie dubbing, for example, where the movie sequence can be modified to sync the actors' lip motions to the new soundtrack.
Video Rewrite automatically labels the phonemes in the training data and in the new audio track. Video Rewrite reorders the mouth images in the training footage to match the phoneme sequence of the new audio track. When particular phonemes are unavailable in the training footage, Video Rewrite selects the closest approximations. The resulting sequence of mouth images is stitched into the background footage. This stitching process automatically corrects for differences in head position and orientation between the mouth images and the background footage.
Video Rewrite uses computer-vision techniques to track points on the speaker's mouth in the training footage, and morphing techniques to combine these mouth gestures into the final video sequence. The new video combines the dynamics of the original actor's articulations with the mannerisms and setting dictated by the background footage. Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack.

References

[1]
T. Beier, S. Neely. Feature-based image metamorphosis. Computer Graphics, 26(2):35--42, 1992. ISSN 0097--8930.
[2]
M.J. Black, Y. Yacoob. Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. Proc. IEEE Int. Conf. Computer Vision, Cam bridge, MA, pp. 374--381, 1995. ISBN 0--8186--7042--8.
[3]
C. Bregler, S. Omohundro. Nonlinear manifold learn ing for visual speech recognition. Proc. IEEE Int. Conf. Com puter Vision, Cambridge, MA, pp. 494--499, 1995. ISBN 0- 8186--7042--8.
[4]
P.J. Burt, E.H. Adelson. A multiresolution spline with application to image mosaics. ACM Trans. Graphics, 2(4): 217--236, 1983. ISSN 0730-0301.
[5]
M.M. Cohen, D.W. Massaro. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Com puter Animation, ed. N.M Thalman, D. Thalman, pp. 139--156, Tokyo: Springer-Verlag, 1993. ISBN 0--3877-0124--9.
[6]
M. Covell, C. Bregler. Eigenpoints. Proc. Int. Conf. Image Processing, Lausanne, Switzerland, Vol. 3, pp. 471-- 474, 1996. ISBN 0--7803--3258-x.
[7]
T. Guiard-Marigny, A. Adjoudani, C. Benoit. A 3-D model of the lips for visual speech synthesis. Proc. ESCA/IEEE Workshop on Speech Synthesis, New Paltz, NY, pp. 49--52, 1994.
[8]
M. Kass, A. Witkin, D. Terzopoulos. Snakes: Active con tour models. Int. J. Computer Vision, 1(4):321--331, 1987. ISSN 0920--5691.
[9]
M. Kirby, L. Sirovich. Application of the Karhunen Loeve procedure for the characterization of human faces. IEEE PAMI, 12(1):103--108, Jan. 1990. ISSN 0162--8828.
[10]
L. F. Lamel, R. H. Kessel, S. Seneff. Speech database development: Design and analysis of the acoustic-phonetic corpus. Proc. Speech Recognition Workshop (DARPA), Report #SAIC-86/1546, pp. 100--109, McLean VA: Science Applica tions International Corp., 1986.
[11]
A. Lanitis, C.J. Taylor, T.F. Cootes. A unified approach for coding and interpreting face images. Proc. Int. Conf. Com puter Vision, Cambridge, MA, pp. 368--373, 1995. ISBN 0- 8186--7042--8.
[12]
J.Lewis. Automated lip-sync: Background and tech niques. J.Visualization and Computer Animation, 2(4):118-- 122, 1991. ISSN 1049--8907.
[13]
P. Litwinowicz, L. Williams. Animating images with drawings. SIGGRAPH 94, Orlando, FL, pp. 409--412, 1994. ISBN 0--89791--667-0.
[14]
S. Morishima, H. Harashima. A media conversion from speech to facial image for intelligent man-machine inter face. IEEE J Selected Areas Communications, 9 (4):594--600, 1991. ISSN 0733--8716.
[15]
E. Moulines, P. Emerard, D. Larreur, J. L. Le Saint Milon, L. Le Faucheur, F. Marty, F. Charpentier, C. Sorin. A real-time French text-to-speech system generating high-quality synthetic speech. Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, pp. 309--312, 1990.
[16]
J.J. Ohala. The frequency code underlies the sound symbolic use of voice pitch. In Sound Symbolism, ed. L. Hin ton, J. Nichols, J. J. Ohala, pp. 325--347, Cambridge UK: Cambridge Univ. Press, 1994. ISBN 0--5214--5219--8.
[17]
E. Owens, B. Blazek. Visemes observed by hearing impaired and normal-hearing adult viewers. J. Speech and Hearing Research, 28:381--393, 1985. ISSN 0022--4685.
[18]
F. Parke. Computer generated animation of faces. Proc. ACM National Conf., pp. 451--457, 1972.
[19]
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in Speech Recognition, ed. A. Waibel, K. F. Lee, pp. 267--296, San Mateo, CA: Morgan Kaufmann Publishers, 1989. ISBN 1- 5586-0124--4.
[20]
K.C. Scott, D.S. Kagels, S.H. Watson, H. Rom, J.R. Wright, M. Lee, K.J. Hussey. Synthesis of speaker facial movement to match selected speech sequences. Proc. Austra lian Conf. Speech Science and Technology, Perth Australia, pp. 620--625, 1994. ISBN 0--8642--2372--2.
[21]
J. Tenenbaum, W. Freeman. Separable mixture models: Separating style and content. In Advances in Neural Information Processing 9, ed. M. Jordan, M. Mozer, T. Petsche, Cambridge, MA: MIT Press, (in press).
[22]
M. Turk, A. Pentland. Eigenfaces for recognition. J. Cog nitive Neuroscience, 3(1):71--86, 1991. ISSN 0898--929X
[23]
A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Informat. Theory, IT-13:260--269, 1967. ISSN 0018--9448.
[24]
K. Waters, T. Levergood. DECface: A System for Syn thetic Face Applications. J. Multimedia Tools and Applica tions, 1 (4):349--366, 1995. ISSN 1380--7501.
[25]
L. Williams. Performance-Driven Facial Animation. Computer Graphics (Proceedings of SIGGRAPH 90), 24(4):235--242, 1990. ISSN 0097--8930.
[26]
A.L. Yuille, D.S. Cohen, P.W. Hallinan. Feature extrac tion from faces using deformable templates. Proc. IEEE Com puter Vision and Pattern Recognition, San Diego, CA, pp. 104--109, 1989. ISBN 0--8186--1952-x

Cited By

View all
  • (2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
  • (2024)Style Transfer for 2D Talking Head Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00745(7500-7509)Online publication date: 17-Jun-2024
  • (2024)EDTalk: Efficient Disentanglement for Emotional Talking Head SynthesisComputer Vision – ECCV 202410.1007/978-3-031-72658-3_23(398-416)Online publication date: 2-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Overlay Books
Seminal Graphics Papers: Pushing the Boundaries, Volume 2
August 2023
893 pages
ISBN:9798400708978
DOI:10.1145/3596711
  • Editor:
  • Mary C. Whitton
  • cover image ACM Conferences
    SIGGRAPH '97: Proceedings of the 24th annual conference on Computer graphics and interactive techniques
    August 1997
    512 pages
    ISBN:0897918967
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 August 2023

Permissions

Request permissions for this article.

Badges

  • Seminal Paper

Author Tags

  1. facial animation
  2. lip sync

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)447
  • Downloads (Last 6 weeks)62
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
  • (2024)Style Transfer for 2D Talking Head Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00745(7500-7509)Online publication date: 17-Jun-2024
  • (2024)EDTalk: Efficient Disentanglement for Emotional Talking Head SynthesisComputer Vision – ECCV 202410.1007/978-3-031-72658-3_23(398-416)Online publication date: 2-Oct-2024
  • (2024)Introduction to Deepfake Technology and Its Early FoundationsDeepfakes and Their Impact on Business10.4018/979-8-3693-6890-9.ch001(1-18)Online publication date: 13-Dec-2024
  • (2024)Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action UnitsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367256520:9(1-24)Online publication date: 17-Jun-2024
  • (2024)Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00521(5280-5290)Online publication date: 3-Jan-2024
  • (2024)Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337671046:9(6001-6022)Online publication date: Sep-2024
  • (2024)3-D Facial Priors Guided Local–Global Motion Collaboration Transforms for One-Shot Talking-Head Video SynthesisIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332368470:1(132-143)Online publication date: Feb-2024
  • (2024)A Multidisciplinary Look at History and Future of Deepfake With Gartner Hype CycleIEEE Security and Privacy10.1109/MSEC.2024.338032422:3(50-61)Online publication date: 8-Apr-2024
  • (2024)ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesisInformation Fusion10.1016/j.inffus.2024.102456110(102456)Online publication date: Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media