More Web Proxy on the site http://driver.im/

research-article

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Authors:

Taras Kucherenko,

Gustav Eje Henter,

Hedvig KjellströmAuthors Info & Claims

IVA '19: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents

Pages 97 - 104

https://doi.org/10.1145/3308532.3329472

Published: 01 July 2019 Publication History

Abstract

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.

Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.

We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

References

[1]

Kirsten Bergmann and Stefan Kopp. 2009. GNetIc--Using Bayesian decision networks for iconic gesture generation. In International Workshop on Intelligent Virtual Agents (IVA '09). Springer, 76--89.

Digital Library

[2]

Paul Boersma. 2002. Praat, a system for doing phonetics by computer. Glot International, Vol. 5, 9/10 (2002), 341--345.

[3]

Cynthia Breazeal, Cory D Kidd, Andrea Lockerd Thomaz, Guy Hoffman, and Matt Berlin. 2005. Effects of nonverbal communication on efficiency and robustness in human-robot teamwork. In International Conference on Intelligent Robots and Systems, (IROS '05). IEEE, 708--713.

[4]

Judith Bütepage, Michael J. Black, Danica Kragic, and Hedvig Kjellström. 2017. Deep representation learning for human notion prediction and classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). IEEE.

[5]

Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat: The behavior expression animation toolkit. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '01). ACM.

Digital Library

[6]

Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In International Workshop on Intelligent Virtual Agents (IVA'11). Springer, 127--140.

Digital Library

[7]

Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: A deep and temporal modeling approach. In International Conference on Intelligent Virtual Agents (IVA '15). Springer.

[8]

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder--decoder approaches. Syntax, Semantics and Structure in Statistical Translation (2014), 103.

[9]

Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE International Conference on Acoustics, Speech and Signal Processing (ACASSP '80), Vol. 28, 4, 357--366.

[10]

David Greenwood, Stephen Laycock, and Iain Matthews. 2017. Predicting head pose from speech with a conditional variational autoencoder. In Conference of the International Speech Communication Association (Interspeech '17). ISCA, 3991--3995.

[11]

Kathrin Haag and Hiroshi Shimodaira. 2016. Bidirectional LS™ networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In International Conference on Intelligent Virtual Agents (IVA '16). Springer, 198--207.

[12]

Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, and Taku Komura. 2017. A recurrent variational autoencoder for human motion synthesis. IEEE Computer Graphics and Applications, Vol. 37 (2017), 4.

[13]

Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In International Conference on Intelligent Virtual Agents (IVA '18). ACM, 79--86.

Digital Library

[14]

Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), Vol. 35, 4 (2016), 138:1--138:11.

Digital Library

[15]

Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia Technical Briefs . 18:1--18:4.

Digital Library

[16]

Chien-Ming Huang and Bilge Mutlu. 2012. Robot behavior toolkit: Generating effective social behaviors for robots. In International Conference on Human Robot Interaction (HRI '12). ACM/IEEE.

Digital Library

[17]

Ryo Ishii, Taichi Katayama, Ryuichiro Higashinaka, and Junji Tomita. 2018. Generating body motions using spoken language in dialogue. In Proceedings of the 18th International Conference on Intelligent Virtual Agents (IVA '18). ACM, 87--92.

Digital Library

[18]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR '15).

[19]

Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013. Nonverbal Communication in Human Interaction .Wadsworth, Cengage Learning.

[20]

Taras Kucherenko. 2018. Data driven non-verbal behavior generation for humanoid robots. In ACM International Conference on Multimodal Interaction, Doctoral Consortium (ICMI '18). ACM, 520--523.

Digital Library

[21]

Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2019. On the importance of representations for speech-driven gesture generation. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19). ACM.

Digital Library

[22]

Sergey Levine, Philipp Kr"ahenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture controllers. ACM Transactions on Graphics (TOG), Vol. 29, 4 (2010), 124.

Digital Library

[23]

Hailong Liu and Tadahiro Taniguchi. 2014. Feature extraction and pattern recognition for human motion by a deep sparse autoencoder. In International Conference on Computer and Information Technology (CIT '14). IEEE, 173--181.

Digital Library

[24]

Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). IEEE, 4674--4683.

[25]

David Matsumoto, Mark G Frank, and Hyi Sung Hwang. 2013. Nonverbal Communication: Science and Applications .Sage.

[26]

David McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought .University of Chicago Press.

[27]

Victor Ng-Thow-Hing, Pengcheng Luo, and Sandra Okita. 2010. Synchronized gesture and speech production for humanoid robots. In International Conference on Intelligent Robots and Systems (IROS '10). IEEE/RSJ.

[28]

Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference (BMVC '18).

[29]

Najmeh Sadoughi and Carlos Busso. 2017. Joint learning of speech-driven facial motion with bidirectional long-short term memory. In International Conference on Intelligent Virtual Agents (IVA '17). Springer, 389--402.

[30]

Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '18). IEEE, 6169--6173.

Digital Library

[31]

Najmeh Sadoughi and Carlos Busso. 2019. Speech-driven animation with meaningful behaviors. Speech Communication, Vol. 110 (2019), 90--100.

Digital Library

[32]

Giampiero Salvi, Jonas Beskow, Samer Al Moubayed, and Björn Granström. 2009. SynFace: Speech-driven facial animation for virtual speech-reading support. EURASIP Journal on Audio, Speech, and Music Processing (2009), 3.

Digital Library

[33]

Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics (TOG), Vol. 36, 4 (2017), 95.

Digital Library

[34]

Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017a. Speech-to-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In International Conference on Human Agent Interaction (HAI '17) .

Digital Library

[35]

Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017b. Creating a gesture-speech dataset for speech-based automatic gesture generation. In International Conference on Human-Computer Interaction (HCI '17). Springer, 198--202.

[36]

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, Vol. 11, Dec (2010), 3371--3408.

Digital Library

[37]

Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Communication, Vol. 57 (2014), 209--232.

Digital Library

[38]

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In International Conference on Robotics and Automation (ICRA '19). IEEE.

Digital Library

[39]

He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG), Vol. 37, 4 (2018), 145.

Digital Library

[40]

Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. 2017. Auto-conditioned recurrent networks for extended complex human motion synthesis. In International Conference on Learning Representations (ICLR '17).

Cited By

Fernández-Rodicio ECastro-González ÁGamboa-Montero JCarrasco-Martínez SSalichs M(2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
https://doi.org/10.3390/s24113671
Wolfert PHenter GBelpaeme T(2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
https://doi.org/10.3390/app14041460
Abdulmohsin H(2024)A speech-based convolutional neural network for human body posture classificationJournal of Intelligent Systems10.1515/jisys-2022-032633:1Online publication date: 15-Oct-2024
https://doi.org/10.1515/jisys-2022-0326
Show More Cited By

Index Terms

Analyzing Input and Output Representations for Speech-Driven Gesture Generation
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Empirical studies in HCI
    2. Interactive systems and tools

Recommendations

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents

We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward ...
On the Importance of Representations for Speech-Driven Gesture Generation
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems

This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-...
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IVA '19: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents

July 2019

282 pages

ISBN:9781450366724

DOI:10.1145/3308532

General Chairs:
Catherine Pelachaud
CNRS-ISIR, Sorbonne Universite, France
,
Jean-Claude Martin
CNRS-LIMSI, Universite Paris Saclay, France
,
Program Chairs:
Hendrik Buschmeier
Bielefeld University, Germany
,
Gale Lucas
University of Southern California, USA
,
Stefan Kopp
Bielefeld University, Germany

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Swedish Foundation for Strategic Research
JSPS Grant-in-Aid for Young Scientists (B)

Conference

IVA '19

Sponsor:

SIGAI

IVA '19: ACM International Conference on Intelligent Virtual Agents

July 2 - 5, 2019

Paris, France

Acceptance Rates

IVA '19 Paper Acceptance Rate 15 of 63 submissions, 24%;

Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

108
Total Citations
View Citations
902
Total Downloads

Downloads (Last 12 months)110
Downloads (Last 6 weeks)17

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fernández-Rodicio ECastro-González ÁGamboa-Montero JCarrasco-Martínez SSalichs M(2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
https://doi.org/10.3390/s24113671
Wolfert PHenter GBelpaeme T(2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
https://doi.org/10.3390/app14041460
Abdulmohsin H(2024)A speech-based convolutional neural network for human body posture classificationJournal of Intelligent Systems10.1515/jisys-2022-032633:1Online publication date: 15-Oct-2024
https://doi.org/10.1515/jisys-2022-0326
Moreno-Villamarin DHilsmann AEisert P(2024)Multi-Resolution Generative Modeling of Human Motion from Limited DataProceedings of 21st ACM SIGGRAPH Conference on Visual Media Production10.1145/3697294.3697309(1-10)Online publication date: 18-Nov-2024
https://dl.acm.org/doi/10.1145/3697294.3697309
Li XDondrup C(2024)A Learning-based Co-Speech Gesture Generation System for Social RobotsProceedings of the 12th International Conference on Human-Agent Interaction10.1145/3687272.3690915(453-455)Online publication date: 24-Nov-2024
https://dl.acm.org/doi/10.1145/3687272.3690915
Abel LColotte VOuni S(2024)Towards interpretable co-speech gestures synthesis using STARGATECompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688819(138-146)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3686215.3688819
Kim SChang MKim YLee J(2024)Body Gesture Generation for Multimodal Conversational AgentsSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687648(1-11)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687648
Ghaleb EKhaertdinov BPouw WRasenberg MHoller JOzyurek AFernandez R(2024)Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic EvaluationProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685707(274-283)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685707
Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Oh JJeong JChai Y(2024)A Transfer Learning Approach for Music-driven 3D Conducting Motion Generation with Limited DataProceedings of the 30th ACM Symposium on Virtual Reality Software and Technology10.1145/3641825.3689531(1-2)Online publication date: 9-Oct-2024
https://dl.acm.org/doi/10.1145/3641825.3689531
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents