More Web Proxy on the site http://driver.im/

research-article

A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation

Authors:

Christian Schuler,

Timo BaumannAuthors Info & Claims

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

Pages 642 - 647

https://doi.org/10.1145/3536221.3556621

Published: 07 November 2022 Publication History

Abstract

We present a comprehensive analysis of the neural audio-visual synchrony evaluation tool SyncNet. We assess the agreement of SyncNet scores vis-a-vis human perception and whether we can use these as a reliable metric for evaluating audio-visual lip-synchrony in generation tasks with no ground truth reference audio-video pair. We further look into the underlying elements in audio and video which vitally affect synchrony using interpretable explanations from SyncNet predictions and analyse its susceptibility by introducing adversarial noise. SyncNet has been used in numerous papers on visually-grounded text-to-speech for scenarios such as dubbing. We focus on this scenario which features many local asynchronies (something that SyncNet isn’t made for).

References

[1]

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. CoRR abs/1809.00496(2018). arXiv:1809.00496http://arxiv.org/abs/1809.00496

[2]

Rehan Ahmad, Syed Zubair, Hani Alquhayz, and Allah Ditta. 2019. Multimodal speaker diarization using a pre-trained audio-visual synchronization model. Sensors 19, 23 (2019), 5163.

[3]

Stefano Arduini and Robert Hodgson. 2007. Similarity and Difference in Translation. Ed. di Storia e Letteratura.

[4]

R. Barsam and D. Mohanan. 2010. Looking at Movies: An Introduction to Film.3 rd ed. New York:W. W. Norton & Company.

[5]

Julie N. Buchan, Martin Paré, and Kevin G. Munhall. 2008. The effect of varying talker identity and listening conditions on gaze behavior during audiovisual speech perception. Brain Research 1242 (Nov. 2008), 162–171. https://doi.org/10.1016/j.brainres.2008.06.083

[6]

Dick Bulterman. 2008. Synchronized Multimedia Integration Language (SMIL 3.0). W3C Recommendation. W3C. https://www.w3.org/TR/2008/REC-SMIL3-20081201/.

[7]

Frederic Chaume. 2018. An overview of audiovisual translation: Four methodological turns in a mature discipline. Journal of Audiovisual Translation 1 (Nov. 2018), 40–63. https://doi.org/10.47476/jat.v1i1.43

[8]

Frederic Chaume Varela. 2004. Synchronization in dubbing: A translational approach. In Benjamins Translation Library, Pilar Orero (Ed.). Vol. 56. John Benjamins Publishing Company, Amsterdam, 35–52. https://doi.org/10.1075/btl.56.07cha

[9]

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.

[10]

Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3444–3453.

[11]

Joon Son Chung and Andrew Zisserman. 2017. Lip Reading in the Wild. In Computer Vision – ACCV 2016, Shang-Hong Lai, Vincent Lepetit, Ko Nishino, and Yoichi Sato (Eds.). Springer International Publishing, Cham, 87–103.

[12]

Joon Son Chung and Andrew Zisserman. 2017. Out of Time: Automated Lip Sync in the Wild. In Computer Vision – ACCV 2016 Workshops, Chu-Song Chen, Jiwen Lu, and Kai-Kuang Ma (Eds.). Springer International Publishing, Cham, 251–263.

[13]

Martin Cooke, Jon Barker, Stuart P. Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition.The Journal of the Acoustical Society of America 120 5 Pt 1 (2006), 2421–4.

[14]

Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, and Liqiang Wang. 2020. Self-Supervised Learning for Audio-Visual Speaker Diarization. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 4367–4371.

[15]

Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael A. Specter, and Lalana Kagal. 2018. Explaining Explanations: An Overview of Interpretability of Machine Learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) (2018), 80–89.

[16]

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572(2015).

[17]

Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford, Miaosen Wang, Ye Jia, and Tal Remez. 2021. More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech. ArXiv abs/2111.10139(2021).

[18]

Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, and Hang Zhao. 2021. Neural Dubber: Dubbing for Videos According to Scripts. Advances in Neural Information Processing Systems 34 (2021).

[19]

Venkatesh S. Kadandale, Juan F. Montesinos, and Gloria Haro. 2022. VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices. ArXiv abs/2204.02090(2022).

[20]

Alina Karakanta, Supratik Bhattacharya, Shravan Nayak, Timo Baumann, Matteo Negri, and Marco Turchi. 2020. The Two Shades of Dubbing in Neural Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 4327–4333. https://doi.org/10.18653/v1/2020.coling-main.382

[21]

You Jin Kim, Hee-Soo Heo, Soo-Whan Chung, and Bong-Jin Lee. 2021. End-To-End Lip Synchronisation Based on Pattern Classification. 2021 IEEE Spoken Language Technology Workshop (SLT) (2021), 598–605.

[22]

Sebastian Kraft and Udo Zölzer. 2014. BeaqleJS: HTML5 and JavaScript based Framework for the Subjective Evaluation of Audio Quality.

[23]

Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, and Haizhou Li. 2022. Visualtts: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2022), 8032–8036.

[24]

Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (Dec. 1976), 746–748. https://doi.org/10.1038/264746a0 Number: 5588 Publisher: Nature Publishing Group.

[25]

Paul Mermelstein. 1976. Distance Measures for Speech Recognition – Psychological and Instrumental. In Pattern Recognition and Artificial Intelligence, Proceedings of the Joint Workshop on Pattern Recognition and Artificial Intelligence, C. H. Chen (Ed.). 374–388.

[26]

A. Natarajan, M. Motani, B. de Silva, K. Yap, and K. C. Chua. 2007. Investigating Network Architectures for Body Sensor Networks. In Network Architectures, G. Whitcomb and P. Neece (Eds.). Keleuven Press, Dayton, OH, 322–328. arXiv:960935712 [cs]

[27]

Shravan Nayak, Timo Baumann, Supratik Bhattacharya, Alina Karakanta, Matteo Negri, and Marco Turchi. 2020. See me Speaking? Differentiating on Whether Words are Spoken On Screen or Off to Optimize Machine Dubbing. In Companion Publication of the 2020 International Conference on Multimodal Interaction. ACM, Virtual Event Netherlands, 130–134. https://doi.org/10.1145/3395035.3425640

Digital Library

[28]

Alp Öktem, Mireia Farrús, and Antonio Bonafonte. 2018. Bilingual Prosodic Dataset Compilation for Spoken Language Translation. In Proceedings of IberSPEECH 2018 (Barcelona, Spain, 21-23 November 2018). 20–24. https://www.isca-speech.org/archive/IberSPEECH_2018/pdfs/IberS18_P1-1_Oktem.pdf

[29]

Margaret H. Pinson. 2011. Audiovisual Quality Components: An Analysis. NA (Nov. 2011). https://www.its.bldrdoc.gov/publications/details.aspx?pub=2565 Publisher: ITS.

[30]

K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. 28th ACM International Conference on Multimedia (ACM MM) (Oct. 2020). https://doi.org/10.1145/3394171.3413532 Publisher: Association for Computing Machinery.

Digital Library

[31]

K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. Association for Computing Machinery, New York, NY, USA, 484–492. https://doi.org/10.1145/3394171.3413532

Digital Library

[32]

Prajwal K R, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 13793–13802.

[33]

Ashutosh Saboo and Timo Baumann. 2019. Integration of Dubbing Constraints into Machine Translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers). Association for Computational Linguistics, Florence, Italy, 94–101. https://doi.org/10.18653/v1/W19-5210

[34]

Debjoy Saha, Shravan Nayak, and Timo Baumann. 2022. Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel’s Weekly Video Podcasts. ArXiv abs/2205.12194(2022).

[35]

Florian Schiel. 2004. MAUS Goes Iterative. In LREC.

[36]

Yoav Shalev and Lior Wolf. 2020. End to End Lip Synchronization with a Temporal AutoEncoder. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020), 330–339.

[37]

Shijing Si, Jianzong Wang, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, and Jing Xiao. 2021. Speech2Video: Cross-Modal Distillation for Speech to Video Generation. In Interspeech.

[38]

Yaroslav V. Sokolovsky. 2010. On the Linguistic Definition of Translation. undefined (2010). https://www.semanticscholar.org/paper/On-the-Linguistic-Definition-of-Translation-Sokolovsky/b08bccc1d956ed35b5d1c5f89d7e9972cd3532ae

[39]

Joon Son Son and Andrew Zisserman. 2017. Lip Reading in Profile. In Proceedings of the British Machine Vision Conference (BMVC), Gabriel Brostow Tae-Kyun Kim, Stefanos Zafeiriou and Krystian Mikolajczyk (Eds.). BMVA Press, Article 155, 11 pages. https://doi.org/10.5244/C.31.155

[40]

International Telecommunication Union. 2015. Recommendation ITU-R BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems. Technical Report. International Telecommunication Union.

[41]

Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. 2019. Adversarial Examples: Attacks and Defenses for Deep Learning. IEEE Transactions on Neural Networks and Learning Systems 30 (2019), 2805–2824.

[42]

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 4174–4184.

Cited By

Yaman DEyiokur FBärmann LEkenel HWaibel A(2024)Audio-Driven Talking Face Generation with Stabilized Synchronization LossComputer Vision – ECCV 202410.1007/978-3-031-72655-2_24(417-435)Online publication date: 6-Dec-2024
https://doi.org/10.1007/978-3-031-72655-2_24

Index Terms

A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation
1. Applied computing
  1. Arts and humanities
    1. Language translation
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores
Computer Vision – ECCV 2024
Abstract
Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted ...
Audio-visual granular synthesis performance demo
IE '13: Proceedings of The 9th Australasian Conference on Interactive Entertainment: Matters of Life and Death

In this paper, I present a prototype of my audio-visual granular synthesis instrument Kortex. The instrument enables real-time improvisation of audio-visual material in a performance context. Granular synthesis is a processing technique that segments ...
Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Emotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

November 2022

830 pages

ISBN:9781450393904

DOI:10.1145/3536221

Editors:
Raj Tumuluri
Openstream
,
Nicu Sebe
University of Trento
,
Gopal Pingali
Accenture
,
Dinesh Babu Jayagopi
IIIT Bangalore
,
Abhinav Dhall
IIT Ropar
,
Richa Singh
IIT Jodhpur
,
Lisa Anthony
University of Florida
,
Albert Ali Salah
Utrecht University and Boğaziçi University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMI '22

Sponsor:

SIGCHI

ICMI '22: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 7 - 11, 2022

Bengaluru, India

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
209
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yaman DEyiokur FBärmann LEkenel HWaibel A(2024)Audio-Driven Talking Face Generation with Stabilized Synchronization LossComputer Vision – ECCV 202410.1007/978-3-031-72655-2_24(417-435)Online publication date: 6-Dec-2024
https://doi.org/10.1007/978-3-031-72655-2_24

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents