More Web Proxy on the site http://driver.im/

research-article

Speech-based Gesture Generation for Robots and Embodied Agents: A Scoping Review

Authors:

Gelareh Mohammadi,

Wafa JohalAuthors Info & Claims

HAI '21: Proceedings of the 9th International Conference on Human-Agent Interaction

Pages 31 - 38

https://doi.org/10.1145/3472307.3484167

Published: 09 November 2021 Publication History

Abstract

Humans use gestures as a means of non-verbal communication. Often accompanying speech, these gestures have several purposes but in general, aim to convey an intended message to the receiver. Researchers have tried to develop systems to allow embodied agents to be better communicators when interacting with humans via using gestures. In this article, we present a scoping literature review of the methods and the metrics used to generate and evaluate co-speech gestures. After collecting a set of papers using a term search on the Scopus database, we analysed the content of these papers based on methodology (i.e., model, the dataset used), evaluation measures (i.e., objective and subjective) and limitations. The results indicate that data-driven approaches are used more frequently. In terms of evaluation measures, we found a trend of combining objective and subjective metrics, while no standards exist for either. This literature review provides an overview of the research in the area and, more specifically insights the trends and the challenges to be met in building a system to automatically generate gestures for embodied agents.

References

[1]

Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision. Springer, 248–265.

Digital Library

[2]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.

[3]

Paul Bremner, Anthony G Pipe, Mike Fraser, Sriram Subramanian, and Chris Melhuish. 2009. Beat gesture generation rules for human-robot interaction. In RO-MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 1029–1034.

[4]

Paul Bremner, Anthony G. Pipe, Chris Melhuish, Mike Fraser, and Sriram Subramanian. 2011. The effects of robot-performed co-verbal gesture on listener behaviour. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. 458–465. https://doi.org/10.1109/Humanoids.2011.6100810

[5]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078(2014).

[6]

Paul Ekman and Wallace V Friesen. 1969. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Nonverbal communication, interaction, and gesture (1969), 57–106.

[7]

Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 93–98.

Digital Library

[8]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-objective adversarial gesture generation. In Motion, Interaction and Games. 1–10.

[9]

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks. arXiv preprint arXiv:1406.2661(2014).

[10]

Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.

Digital Library

[11]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500(2017).

[12]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[13]

Sebastian Höfer, Kostas Bekris, Ankur Handa, Juan Camilo Gamboa, Florian Golemo, Melissa Mozifian, Chris Atkeson, Dieter Fox, Ken Goldberg, John Leonard, C. Karen Liu, Jan Peters, Shuran Song, Peter Welinder, and Martha White. 2020. Perspectives on Sim2Real Transfer for Robotics: A Summary of the R:SS 2020 Workshop. (Dec. 2020). https://www.arxiv-vanity.com/papers/2012.03806/

[14]

Carlos T Ishi, Daichi Machiyashiki, Ryusuke Mikata, and Hiroshi Ishiguro. 2018. A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robotics and Automation Letters 3, 4 (2018), 3757–3764.

[15]

Wafa Johal, Gaëlle Calvary, and Sylvie Pesty. 2015. Non-verbal Signals in HRI: Interference in Human Perception. In International Conference on Social Robotics. Springer, 275–284.

[16]

Yuki Kadono, Yutaka Takase, and Yukiko I Nakano. 2016. Generating iconic gestures based on graphic data analysis and clustering. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 447–448.

[17]

Adam Kendon. 2004. Gesture: Visible action as utterance. Cambridge University Press.

[18]

Heon-Hui Kim, Yun-Su Ha, Zeungnam Bien, and Kwang-Hyun Park. 2012. Gesture encoding and reproduction for human-robot interaction in text-to-gesture systems. Industrial Robot: An International Journal(2012).

[19]

Jaewoo Kim, Woo Hyun Kim, Won Hyong Lee, Ju-Hwan Seo, Myung Jin Chung, and Dong-Soo Kwon. 2012. Automated robot speech gesture generation system based on dialog sentence punctuation mark extraction. In 2012 IEEE/SICE International Symposium on System Integration (SII). IEEE, 645–647.

[20]

Sotaro Kita. 2009. Cross-cultural variation of speech-accompanying gesture: A review. Language and cognitive processes 24, 2 (2009), 145–167.

[21]

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 97–104.

Digital Library

[22]

Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2019. On the importance of representations for speech-driven gesture generation. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2072–2074.

[23]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242–250.

Digital Library

[24]

Quoc Anh Le, Souheil Hanoune, and Catherine Pelachaud. 2011. Design and implementation of an expressive gesture model for a humanoid robot. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. IEEE, 134–140.

[25]

David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.

[26]

Gregor Mehlmann, Markus Häring, Kathrin Janowski, Tobias Baur, Patrick Gebhard, and Elisabeth André. 2014. Exploring a model of gaze for grounding in multimodal HRI. In Proceedings of the 16th International Conference on Multimodal Interaction. 247–254.

Digital Library

[27]

Izidor Mlakar, Zdravko Kačič, and Matej Rojc. 2013. TTS-driven synthetic behaviour-generation model for artificial bodies. International Journal of Advanced Robotic Systems 10, 10 (2013), 344.

[28]

Mohammad Obaid, Wafa Johal, and Omar Mubin. 2020. Domestic Drones: Context of Use in Research Literature. In Proceedings of the 8th International Conference on Human-Agent Interaction (Virtual Event, USA) (HAI ’20). Association for Computing Machinery, New York, NY, USA, 196–203. https://doi.org/10.1145/3406499.3415076

Digital Library

[29]

Stanislav Ondáš, Jozef Juhár, Matúš Pleva, Peter Ferčák, and Rastislav Husovskỳ. 2017. Multimodal dialogue system with NAO and VoiceXML dialogue manager. In 2017 8th IEEE International Conference on Cognitive Infocommunications (CogInfoCom). IEEE, 000439–000444.

[30]

Micah DJ Peters, Christina M Godfrey, Hanan Khalil, Patricia McInerney, Deborah Parker, and Cassia Baldini Soares. 2015. Guidance for conducting systematic scoping reviews. JBI Evidence Implementation 13, 3 (2015), 141–146.

[31]

Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. 2020. Sign language recognition: A deep survey. Expert Systems with Applications(2020), 113794.

[32]

Maha Salem, Friederike Eyssel, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2011. Effects of gesture on the perception of psychological anthropomorphism: a case study with a humanoid robot. In International conference on social robotics. Springer, 31–41.

Digital Library

[33]

Maha Salem, Friederike Eyssel, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2013. To err is human (-like): Effects of robot gesture on perceived anthropomorphism and likability. International Journal of Social Robotics 5, 3 (2013), 313–323.

[34]

Akihito Shimazu, Chie Hieida, Takayuki Nagai, Tomoaki Nakamura, Yuki Takeda, Takenori Hara, Osamu Nakagawa, and Tsuyoshi Maeda. 2018. Generation of gestures during presentation for humanoid robots. In 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 961–968.

Digital Library

[35]

Janis Stolzenwald and Paul Bremner. 2017. Gesture mimicry in social human-robot interaction. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). 430–436. https://doi.org/10.1109/ROMAN.2017.8172338

Digital Library

[36]

Jürgen Streeck. 1993. Gesture as communication I: Its coordination with gaze and speech. Communications Monographs 60, 4 (1993), 275–299.

[37]

Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. 365–369.

Digital Library

[38]

Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a gesture-speech dataset for speech-based automatic gesture generation. In International Conference on Human-Computer Interaction. Springer, 198–202.

[39]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[40]

Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2021. A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents. arxiv:2101.03769 [cs.HC]

[41]

Bowen Wu, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2021. Modeling the Conditional Distribution of Co-Speech Upper Body Gesture Jointly Using Conditional-GAN and Unrolled-GAN. Electronics 10, 3 (2021), 228.

[42]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.

Digital Library

[43]

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303–4309.

Digital Library

Cited By

Wakabayashi HHiroi YMiyawaki KIto A(2024)Development of a Personal Guide Robot That Leads a Guest Hand-in-Hand While Keeping a DistanceSensors10.3390/s2407234524:7(2345)Online publication date: 7-Apr-2024
https://doi.org/10.3390/s24072345
Aicher AMatsuda YYasumoto KMinker WAndré EUltes S(2024)Exploring the Impact of Non-Verbal Virtual Agent Behavior on User Engagement in Argumentative DialoguesProceedings of the 12th International Conference on Human-Agent Interaction10.1145/3687272.3688315(224-232)Online publication date: 24-Nov-2024
https://dl.acm.org/doi/10.1145/3687272.3688315
Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Show More Cited By

Index Terms

Speech-based Gesture Generation for Robots and Embodied Agents: A Scoping Review
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
    2. Natural language processing
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
    2. Interaction paradigms
      1. Natural language interfaces

Index terms have been assigned to the content through auto-classification.

Recommendations

Gesticulator: A framework for semantically-aware speech-driven gesture generation
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture ...
Evaluating data-driven co-speech gestures of embodied conversational agents through real-time interaction
IVA '22: Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents

Embodied Conversational Agents (ECAs) that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have attracted considerable research attention, and ...
A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents
AAMAS '21: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems

Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation -- hand and arm movements accompanying speech -- is an essential part of non-verbal behavior. Gesture generation models ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HAI '21: Proceedings of the 9th International Conference on Human-Agent Interaction

November 2021

447 pages

ISBN:9781450386203

DOI:10.1145/3472307

Editors:
Kohei Ogawa
Nagoya University, Japan
,
Tomoko Yonezawa
Kansai University, Japan
,
Gale M. Lucas
USC Institute for Creative Technologies, United States
,
Hirotaka Osawa
University of Tsukuba, Japan
,
Wafa Johal
University of New South Wales, Australia
,
Masahiro Shiomi,
Advanced Telecommunications Research Institute International, Japan

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HAI '21

Sponsor:

SIGCHI

HAI '21: International Conference on Human-Agent Interaction

November 9 - 11, 2021

Virtual Event, Japan

Acceptance Rates

Overall Acceptance Rate 121 of 404 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
558
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)16

Reflects downloads up to 31 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wakabayashi HHiroi YMiyawaki KIto A(2024)Development of a Personal Guide Robot That Leads a Guest Hand-in-Hand While Keeping a DistanceSensors10.3390/s2407234524:7(2345)Online publication date: 7-Apr-2024
https://doi.org/10.3390/s24072345
Aicher AMatsuda YYasumoto KMinker WAndré EUltes S(2024)Exploring the Impact of Non-Verbal Virtual Agent Behavior on User Engagement in Argumentative DialoguesProceedings of the 12th International Conference on Human-Agent Interaction10.1145/3687272.3688315(224-232)Online publication date: 24-Nov-2024
https://dl.acm.org/doi/10.1145/3687272.3688315
Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Robrecht AVoss HGottschalk LKopp S(2024)A Study on Integrating Representational Gestures into Automatically Generated Embodied ExplanationsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673919(1-5)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3652988.3673919
Fujii AFukuda KGrollman DBroadbent EJu WSoh HWilliams T(2024)Initial Study on Robot Emotional Expression Using ManpuCompanion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction10.1145/3610978.3640652(463-467)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3610978.3640652
Oralbayeva NAly ASandygulova ABelpaeme T(2024)Data-driven Communicative Behaviour Generation: A SurveyACM Transactions on Human-Robot Interaction10.1145/360923513:1(1-39)Online publication date: 30-Jan-2024
https://dl.acm.org/doi/10.1145/3609235
Mehta SDeichler AO’Regan JMoëll BBeskow JHenter GAlexanderson S(2024)Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00201(1952-1964)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00201
Pugi LSorrentino AFiorini LCavallo F(2024)Design and Implementation of a Storytelling Robot: Preliminary Evaluation of a GAN-Based Model for Co-Speech Gesture GenerationAmbient Assisted Living10.1007/978-3-031-77318-1_25(373-385)Online publication date: 20-Dec-2024
https://doi.org/10.1007/978-3-031-77318-1_25
Alexanderson SNagy RBeskow JHenter G(2023)Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion ModelsACM Transactions on Graphics10.1145/359245842:4(1-20)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592458
Nyatsanga SKucherenko TAhuja CHenter GNeff M(2023)A Comprehensive Review of Data‐Driven Co‐Speech Gesture GenerationComputer Graphics Forum10.1111/cgf.1477642:2(569-596)Online publication date: 23-May-2023
https://doi.org/10.1111/cgf.14776
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents