More Web Proxy on the site http://driver.im/

research-article

Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022

Authors:

Saeed Ghorbani,

Marc-André CarbonneauAuthors Info & Claims

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

Pages 778 - 783

https://doi.org/10.1145/3536221.3558068

Published: 07 November 2022 Publication History

Abstract

We present our entry to the GENEA Challenge of 2022 on data-driven co-speech gesture generation. Our system is a neural network that generates gesture animation from an input audio file. The motion style generated by the model is extracted from an exemplar motion clip. Style is embedded in a latent space using a variational framework. This architecture allows for generating in styles unseen during training. Moreover, the probabilistic nature of our variational framework furthermore enables the generation of a variety of outputs given the same input, addressing the stochastic nature of gesture motion. The GENEA challenge evaluation showed that our model produces full-body motion with highly competitive levels of human-likeness.

References

[1]

Kfir Aberman, Yijia Weng, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. 2020. Unpaired motion style transfer from video to animation. ACM Transactions on Graphics (TOG) 39, 4 (2020), 64–1.

Digital Library

[2]

Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision. Springer, 248–265.

Digital Library

[3]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.

[4]

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. 2016. Generating Sentences from a Continuous Space. In SIGNLL Conference on Computational Natural Language Learning (CONLL). arXiv:1511.06349http://arxiv.org/abs/1511.06349

[5]

Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 93–98.

Digital Library

[6]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-objective adversarial gesture generation. In Motion, Interaction and Games. 1–10.

[7]

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497–3506.

[8]

Susan Goldin-Meadow. 1999. The role of gesture in communication and thinking. Trends in Cognitive Sciences 3, 11 (1999), 419–429. https://doi.org/10.1016/S1364-6613(99)01397-2

[9]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.). Vol. 27. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

[10]

Daniel Holden, Oussama Kanoun, Maksym Perepichka, and Tiberiu Popa. 2020. Learned motion matching. ACM Transactions on Graphics (TOG) 39, 4 (2020), 53–1.

Digital Library

[11]

Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, and Ruoming Pang. 2019. Hierarchical Generative Modeling for Controllable Speech Synthesis. In ICLR. arxiv:1810.07217http://arxiv.org/abs/1810.07217

[12]

Radiocommunication Section ITU. 2015. Algorithms to measure audio programme loudness and true-peak audio level BS Series Broadcasting service (sound). Technical Report. https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf

[13]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013).

[14]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242–250.

Digital Library

[15]

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In 26th International Conference on Intelligent User Interfaces. 11–21.

Digital Library

[16]

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. 2019. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 763–772.

[17]

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2019. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265(2019).

[18]

Michael Neff, Michael Kipp, Irene Albrecht, and Hans-Peter Seidel. 2008. Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Transactions on Graphics (TOG) 27, 1 (2008), 1–24.

Digital Library

[19]

Regina Pally. 2008. A Primary Role for Nonverbal Communication in Psychoanalysis. Psychoanalytic Inquiry 21, 1 (2008), 71–93. https://doi.org/10.1080/07351692109348924

[20]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. FiLM: Visual reasoning with a general conditioning layer. 32nd AAAI Conf. Artif. Intell.(2018), 3942–3951. arXiv:1709.07871

[21]

Manuel Rebol, Christian Güti, and Krzysztof Pietroszek. 2021. Passing a Non-verbal Turing Test: Evaluatina Gesture Animations Generated from Speech. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 573–581.

[22]

Michal Rolinek, Dominik Zietlow, and Georg Martius. 2019. Variational Autoencoders Pursue PCA Directions (by Accident). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]

Sinan Sonlu, Uğur Güdükbay, and Funda Durupinar. 2021. A Conversational Agent Framework with Multi-modal Personality Expression. ACM Transactions on Graphics (TOG) 40, 1 (2021), 1–16.

Digital Library

[24]

Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, André Holzapfel, Pierre-Yves Oudeyer, and Simon Alexanderson. 2021. Transflower: probabilistic autoregressive dance generation with multimodal attention. arXiv preprint arXiv:2106.13871(2021).

[25]

Yuxuan Wang, Daisy Stanton, Yu Zhang, Rj Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous. 2018. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. (2018).

[26]

Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2022. A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents. IEEE Transactions on Human-Machine Systems 52, 3 (2022), 379–389. https://doi.org/10.1109/THMS.2022.3149173

[27]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.

Digital Library

[28]

Youngwoo Yoon, Keunwoo Park, Minsu Jang, Jaehong Kim, and Geehyuk Lee. 2021. Sgtoolkit: An interactive gesture authoring toolkit for embodied conversational agents. In The 34th Annual ACM Symposium on User Interface Software and Technology. 826–840.

Digital Library

[29]

Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. GENEA Challenge 2022 evaluation results description v. 1.0. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.

[30]

Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, and Marc-André Carbonneau. 2021. Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis. arXiv preprint arXiv:2108.02271(2021).

[31]

He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–11.

Digital Library

[32]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

Cited By

Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Alexanderson SNagy RBeskow JHenter G(2023)Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion ModelsACM Transactions on Graphics10.1145/359245842:4(1-20)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592458
Korzun VBeloborodova AIlin A(2023)The FineMotion entry to the GENEA Challenge 2023: DeepPhase for conversational gestures generationProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616119(786-791)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616119
Show More Cited By

Index Terms

Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022
1. Computing methodologies
  1. Computer graphics
    1. Animation
  2. Machine learning

Recommendations

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was ...
The DeepMotion entry to the GENEA Challenge 2022
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

This paper describes the method and evaluation results of our DeepMotion entry to the GENEA Challenge 2022. One difficulty in data-driven gesture synthesis is that there may be multiple viable gesture motions for the same speech utterance. Therefore ...
Highly stylised drawn animation
CGI'06: Proceedings of the 24th international conference on Advances in Computer Graphics

In this paper we argue for our NPAR system as an effective 2D alternative to most of NPR research which is focused on frame coherent stylised rendering of 3D models. Our approach gives a highly stylised look to images without the support of 3D models, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

November 2022

830 pages

ISBN:9781450393904

DOI:10.1145/3536221

Editors:
Raj Tumuluri
Openstream
,
Nicu Sebe
University of Trento
,
Gopal Pingali
Accenture
,
Dinesh Babu Jayagopi
IIIT Bangalore
,
Abhinav Dhall
IIT Ropar
,
Richa Singh
IIT Jodhpur
,
Lisa Anthony
University of Florida
,
Albert Ali Salah
Utrecht University and Boğaziçi University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMI '22

Sponsor:

SIGCHI

ICMI '22: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 7 - 11, 2022

Bengaluru, India

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
151
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)4

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Alexanderson SNagy RBeskow JHenter G(2023)Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion ModelsACM Transactions on Graphics10.1145/359245842:4(1-20)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592458
Korzun VBeloborodova AIlin A(2023)The FineMotion entry to the GENEA Challenge 2023: DeepPhase for conversational gestures generationProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616119(786-791)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616119
Voß HKopp S(2023)AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture SynthesisProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614135(60-69)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3614135
Voß HKopp SLugrin BLatoschik Mvon Mammen SKopp SPécune FPelachaud C(2023)Augmented Co-Speech Gesture GenerationProceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607337(1-8)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1145/3570945.3607337
Ishii RMorikawa AEitoku SFukayama ANakamura TLugrin BLatoschik Mvon Mammen SKopp SPécune FPelachaud C(2023)How Far ahead Can Model Predict Gesture Pose from Speech and Spoken Text?Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607336(1-3)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1145/3570945.3607336
Nyatsanga SKucherenko TAhuja CHenter GNeff M(2023)A Comprehensive Review of Data‐Driven Co‐Speech Gesture GenerationComputer Graphics Forum10.1111/cgf.1477642:2(569-596)Online publication date: 23-May-2023
https://doi.org/10.1111/cgf.14776
Yoon YWolfert PKucherenko TViegas CNikolov TTsakov MHenter G(2022)The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generationProceedings of the 2022 International Conference on Multimodal Interaction10.1145/3536221.3558058(736-747)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3536221.3558058

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents