[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3308532.3329472acmconferencesArticle/Chapter ViewAbstractPublication PagesivaConference Proceedingsconference-collections
research-article

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Published: 01 July 2019 Publication History

Abstract

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.
Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.
We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

References

[1]
Kirsten Bergmann and Stefan Kopp. 2009. GNetIc--Using Bayesian decision networks for iconic gesture generation. In International Workshop on Intelligent Virtual Agents (IVA '09). Springer, 76--89.
[2]
Paul Boersma. 2002. Praat, a system for doing phonetics by computer. Glot International, Vol. 5, 9/10 (2002), 341--345.
[3]
Cynthia Breazeal, Cory D Kidd, Andrea Lockerd Thomaz, Guy Hoffman, and Matt Berlin. 2005. Effects of nonverbal communication on efficiency and robustness in human-robot teamwork. In International Conference on Intelligent Robots and Systems, (IROS '05). IEEE, 708--713.
[4]
Judith Bütepage, Michael J. Black, Danica Kragic, and Hedvig Kjellström. 2017. Deep representation learning for human notion prediction and classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). IEEE.
[5]
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat: The behavior expression animation toolkit. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '01). ACM.
[6]
Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In International Workshop on Intelligent Virtual Agents (IVA'11). Springer, 127--140.
[7]
Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: A deep and temporal modeling approach. In International Conference on Intelligent Virtual Agents (IVA '15). Springer.
[8]
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder--decoder approaches. Syntax, Semantics and Structure in Statistical Translation (2014), 103.
[9]
Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE International Conference on Acoustics, Speech and Signal Processing (ACASSP '80), Vol. 28, 4, 357--366.
[10]
David Greenwood, Stephen Laycock, and Iain Matthews. 2017. Predicting head pose from speech with a conditional variational autoencoder. In Conference of the International Speech Communication Association (Interspeech '17). ISCA, 3991--3995.
[11]
Kathrin Haag and Hiroshi Shimodaira. 2016. Bidirectional LS™ networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In International Conference on Intelligent Virtual Agents (IVA '16). Springer, 198--207.
[12]
Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, and Taku Komura. 2017. A recurrent variational autoencoder for human motion synthesis. IEEE Computer Graphics and Applications, Vol. 37 (2017), 4.
[13]
Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In International Conference on Intelligent Virtual Agents (IVA '18). ACM, 79--86.
[14]
Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), Vol. 35, 4 (2016), 138:1--138:11.
[15]
Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia Technical Briefs . 18:1--18:4.
[16]
Chien-Ming Huang and Bilge Mutlu. 2012. Robot behavior toolkit: Generating effective social behaviors for robots. In International Conference on Human Robot Interaction (HRI '12). ACM/IEEE.
[17]
Ryo Ishii, Taichi Katayama, Ryuichiro Higashinaka, and Junji Tomita. 2018. Generating body motions using spoken language in dialogue. In Proceedings of the 18th International Conference on Intelligent Virtual Agents (IVA '18). ACM, 87--92.
[18]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR '15).
[19]
Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013. Nonverbal Communication in Human Interaction .Wadsworth, Cengage Learning.
[20]
Taras Kucherenko. 2018. Data driven non-verbal behavior generation for humanoid robots. In ACM International Conference on Multimodal Interaction, Doctoral Consortium (ICMI '18). ACM, 520--523.
[21]
Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2019. On the importance of representations for speech-driven gesture generation. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19). ACM.
[22]
Sergey Levine, Philipp Kr"ahenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture controllers. ACM Transactions on Graphics (TOG), Vol. 29, 4 (2010), 124.
[23]
Hailong Liu and Tadahiro Taniguchi. 2014. Feature extraction and pattern recognition for human motion by a deep sparse autoencoder. In International Conference on Computer and Information Technology (CIT '14). IEEE, 173--181.
[24]
Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). IEEE, 4674--4683.
[25]
David Matsumoto, Mark G Frank, and Hyi Sung Hwang. 2013. Nonverbal Communication: Science and Applications .Sage.
[26]
David McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought .University of Chicago Press.
[27]
Victor Ng-Thow-Hing, Pengcheng Luo, and Sandra Okita. 2010. Synchronized gesture and speech production for humanoid robots. In International Conference on Intelligent Robots and Systems (IROS '10). IEEE/RSJ.
[28]
Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference (BMVC '18).
[29]
Najmeh Sadoughi and Carlos Busso. 2017. Joint learning of speech-driven facial motion with bidirectional long-short term memory. In International Conference on Intelligent Virtual Agents (IVA '17). Springer, 389--402.
[30]
Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '18). IEEE, 6169--6173.
[31]
Najmeh Sadoughi and Carlos Busso. 2019. Speech-driven animation with meaningful behaviors. Speech Communication, Vol. 110 (2019), 90--100.
[32]
Giampiero Salvi, Jonas Beskow, Samer Al Moubayed, and Björn Granström. 2009. SynFace: Speech-driven facial animation for virtual speech-reading support. EURASIP Journal on Audio, Speech, and Music Processing (2009), 3.
[33]
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics (TOG), Vol. 36, 4 (2017), 95.
[34]
Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017a. Speech-to-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In International Conference on Human Agent Interaction (HAI '17) .
[35]
Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017b. Creating a gesture-speech dataset for speech-based automatic gesture generation. In International Conference on Human-Computer Interaction (HCI '17). Springer, 198--202.
[36]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, Vol. 11, Dec (2010), 3371--3408.
[37]
Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Communication, Vol. 57 (2014), 209--232.
[38]
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In International Conference on Robotics and Automation (ICRA '19). IEEE.
[39]
He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG), Vol. 37, 4 (2018), 145.
[40]
Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. 2017. Auto-conditioned recurrent networks for extended complex human motion synthesis. In International Conference on Learning Representations (ICLR '17).

Cited By

View all
  • (2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
  • (2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
  • (2024)A speech-based convolutional neural network for human body posture classificationJournal of Intelligent Systems10.1515/jisys-2022-032633:1Online publication date: 15-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
IVA '19: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents
July 2019
282 pages
ISBN:9781450366724
DOI:10.1145/3308532
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. gesture generation
  3. gesture synthesis
  4. neural network
  5. representation learning
  6. social robotics
  7. virtual agents

Qualifiers

  • Research-article

Funding Sources

  • Swedish Foundation for Strategic Research
  • JSPS Grant-in-Aid for Young Scientists (B)

Conference

IVA '19
Sponsor:

Acceptance Rates

IVA '19 Paper Acceptance Rate 15 of 63 submissions, 24%;
Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)110
  • Downloads (Last 6 weeks)17
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
  • (2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
  • (2024)A speech-based convolutional neural network for human body posture classificationJournal of Intelligent Systems10.1515/jisys-2022-032633:1Online publication date: 15-Oct-2024
  • (2024)Multi-Resolution Generative Modeling of Human Motion from Limited DataProceedings of 21st ACM SIGGRAPH Conference on Visual Media Production10.1145/3697294.3697309(1-10)Online publication date: 18-Nov-2024
  • (2024)A Learning-based Co-Speech Gesture Generation System for Social RobotsProceedings of the 12th International Conference on Human-Agent Interaction10.1145/3687272.3690915(453-455)Online publication date: 24-Nov-2024
  • (2024)Towards interpretable co-speech gestures synthesis using STARGATECompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688819(138-146)Online publication date: 4-Nov-2024
  • (2024)Body Gesture Generation for Multimodal Conversational AgentsSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687648(1-11)Online publication date: 3-Dec-2024
  • (2024)Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic EvaluationProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685707(274-283)Online publication date: 4-Nov-2024
  • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
  • (2024)A Transfer Learning Approach for Music-driven 3D Conducting Motion Generation with Limited DataProceedings of the 30th ACM Symposium on Virtual Reality Software and Technology10.1145/3641825.3689531(1-2)Online publication date: 9-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media