[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472307.3484167acmconferencesArticle/Chapter ViewAbstractPublication PageshaiConference Proceedingsconference-collections
research-article

Speech-based Gesture Generation for Robots and Embodied Agents: A Scoping Review

Published: 09 November 2021 Publication History

Abstract

Humans use gestures as a means of non-verbal communication. Often accompanying speech, these gestures have several purposes but in general, aim to convey an intended message to the receiver. Researchers have tried to develop systems to allow embodied agents to be better communicators when interacting with humans via using gestures. In this article, we present a scoping literature review of the methods and the metrics used to generate and evaluate co-speech gestures. After collecting a set of papers using a term search on the Scopus database, we analysed the content of these papers based on methodology (i.e., model, the dataset used), evaluation measures (i.e., objective and subjective) and limitations. The results indicate that data-driven approaches are used more frequently. In terms of evaluation measures, we found a trend of combining objective and subjective metrics, while no standards exist for either. This literature review provides an overview of the research in the area and, more specifically insights the trends and the challenges to be met in building a system to automatically generate gestures for embodied agents.

References

[1]
Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision. Springer, 248–265.
[2]
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.
[3]
Paul Bremner, Anthony G Pipe, Mike Fraser, Sriram Subramanian, and Chris Melhuish. 2009. Beat gesture generation rules for human-robot interaction. In RO-MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 1029–1034.
[4]
Paul Bremner, Anthony G. Pipe, Chris Melhuish, Mike Fraser, and Sriram Subramanian. 2011. The effects of robot-performed co-verbal gesture on listener behaviour. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. 458–465. https://doi.org/10.1109/Humanoids.2011.6100810
[5]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078(2014).
[6]
Paul Ekman and Wallace V Friesen. 1969. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Nonverbal communication, interaction, and gesture (1969), 57–106.
[7]
Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 93–98.
[8]
Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-objective adversarial gesture generation. In Motion, Interaction and Games. 1–10.
[9]
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks. arXiv preprint arXiv:1406.2661(2014).
[10]
Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.
[11]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500(2017).
[12]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[13]
Sebastian Höfer, Kostas Bekris, Ankur Handa, Juan Camilo Gamboa, Florian Golemo, Melissa Mozifian, Chris Atkeson, Dieter Fox, Ken Goldberg, John Leonard, C. Karen Liu, Jan Peters, Shuran Song, Peter Welinder, and Martha White. 2020. Perspectives on Sim2Real Transfer for Robotics: A Summary of the R:SS 2020 Workshop. (Dec. 2020). https://www.arxiv-vanity.com/papers/2012.03806/
[14]
Carlos T Ishi, Daichi Machiyashiki, Ryusuke Mikata, and Hiroshi Ishiguro. 2018. A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robotics and Automation Letters 3, 4 (2018), 3757–3764.
[15]
Wafa Johal, Gaëlle Calvary, and Sylvie Pesty. 2015. Non-verbal Signals in HRI: Interference in Human Perception. In International Conference on Social Robotics. Springer, 275–284.
[16]
Yuki Kadono, Yutaka Takase, and Yukiko I Nakano. 2016. Generating iconic gestures based on graphic data analysis and clustering. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 447–448.
[17]
Adam Kendon. 2004. Gesture: Visible action as utterance. Cambridge University Press.
[18]
Heon-Hui Kim, Yun-Su Ha, Zeungnam Bien, and Kwang-Hyun Park. 2012. Gesture encoding and reproduction for human-robot interaction in text-to-gesture systems. Industrial Robot: An International Journal(2012).
[19]
Jaewoo Kim, Woo Hyun Kim, Won Hyong Lee, Ju-Hwan Seo, Myung Jin Chung, and Dong-Soo Kwon. 2012. Automated robot speech gesture generation system based on dialog sentence punctuation mark extraction. In 2012 IEEE/SICE International Symposium on System Integration (SII). IEEE, 645–647.
[20]
Sotaro Kita. 2009. Cross-cultural variation of speech-accompanying gesture: A review. Language and cognitive processes 24, 2 (2009), 145–167.
[21]
Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 97–104.
[22]
Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2019. On the importance of representations for speech-driven gesture generation. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2072–2074.
[23]
Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242–250.
[24]
Quoc Anh Le, Souheil Hanoune, and Catherine Pelachaud. 2011. Design and implementation of an expressive gesture model for a humanoid robot. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. IEEE, 134–140.
[25]
David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.
[26]
Gregor Mehlmann, Markus Häring, Kathrin Janowski, Tobias Baur, Patrick Gebhard, and Elisabeth André. 2014. Exploring a model of gaze for grounding in multimodal HRI. In Proceedings of the 16th International Conference on Multimodal Interaction. 247–254.
[27]
Izidor Mlakar, Zdravko Kačič, and Matej Rojc. 2013. TTS-driven synthetic behaviour-generation model for artificial bodies. International Journal of Advanced Robotic Systems 10, 10 (2013), 344.
[28]
Mohammad Obaid, Wafa Johal, and Omar Mubin. 2020. Domestic Drones: Context of Use in Research Literature. In Proceedings of the 8th International Conference on Human-Agent Interaction (Virtual Event, USA) (HAI ’20). Association for Computing Machinery, New York, NY, USA, 196–203. https://doi.org/10.1145/3406499.3415076
[29]
Stanislav Ondáš, Jozef Juhár, Matúš Pleva, Peter Ferčák, and Rastislav Husovskỳ. 2017. Multimodal dialogue system with NAO and VoiceXML dialogue manager. In 2017 8th IEEE International Conference on Cognitive Infocommunications (CogInfoCom). IEEE, 000439–000444.
[30]
Micah DJ Peters, Christina M Godfrey, Hanan Khalil, Patricia McInerney, Deborah Parker, and Cassia Baldini Soares. 2015. Guidance for conducting systematic scoping reviews. JBI Evidence Implementation 13, 3 (2015), 141–146.
[31]
Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. 2020. Sign language recognition: A deep survey. Expert Systems with Applications(2020), 113794.
[32]
Maha Salem, Friederike Eyssel, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2011. Effects of gesture on the perception of psychological anthropomorphism: a case study with a humanoid robot. In International conference on social robotics. Springer, 31–41.
[33]
Maha Salem, Friederike Eyssel, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2013. To err is human (-like): Effects of robot gesture on perceived anthropomorphism and likability. International Journal of Social Robotics 5, 3 (2013), 313–323.
[34]
Akihito Shimazu, Chie Hieida, Takayuki Nagai, Tomoaki Nakamura, Yuki Takeda, Takenori Hara, Osamu Nakagawa, and Tsuyoshi Maeda. 2018. Generation of gestures during presentation for humanoid robots. In 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 961–968.
[35]
Janis Stolzenwald and Paul Bremner. 2017. Gesture mimicry in social human-robot interaction. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). 430–436. https://doi.org/10.1109/ROMAN.2017.8172338
[36]
Jürgen Streeck. 1993. Gesture as communication I: Its coordination with gaze and speech. Communications Monographs 60, 4 (1993), 275–299.
[37]
Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. 365–369.
[38]
Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a gesture-speech dataset for speech-based automatic gesture generation. In International Conference on Human-Computer Interaction. Springer, 198–202.
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
[40]
Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2021. A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents. arxiv:2101.03769 [cs.HC]
[41]
Bowen Wu, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2021. Modeling the Conditional Distribution of Co-Speech Upper Body Gesture Jointly Using Conditional-GAN and Unrolled-GAN. Electronics 10, 3 (2021), 228.
[42]
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.
[43]
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303–4309.

Cited By

View all
  • (2024)Development of a Personal Guide Robot That Leads a Guest Hand-in-Hand While Keeping a DistanceSensors10.3390/s2407234524:7(2345)Online publication date: 7-Apr-2024
  • (2024)Exploring the Impact of Non-Verbal Virtual Agent Behavior on User Engagement in Argumentative DialoguesProceedings of the 12th International Conference on Human-Agent Interaction10.1145/3687272.3688315(224-232)Online publication date: 24-Nov-2024
  • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
  • Show More Cited By

Index Terms

  1. Speech-based Gesture Generation for Robots and Embodied Agents: A Scoping Review
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          HAI '21: Proceedings of the 9th International Conference on Human-Agent Interaction
          November 2021
          447 pages
          ISBN:9781450386203
          DOI:10.1145/3472307
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 09 November 2021

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. co-speech gestures
          2. gesture generation
          3. literature review
          4. robot
          5. survey

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          HAI '21
          Sponsor:
          HAI '21: International Conference on Human-Agent Interaction
          November 9 - 11, 2021
          Virtual Event, Japan

          Acceptance Rates

          Overall Acceptance Rate 121 of 404 submissions, 30%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)128
          • Downloads (Last 6 weeks)16
          Reflects downloads up to 31 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Development of a Personal Guide Robot That Leads a Guest Hand-in-Hand While Keeping a DistanceSensors10.3390/s2407234524:7(2345)Online publication date: 7-Apr-2024
          • (2024)Exploring the Impact of Non-Verbal Virtual Agent Behavior on User Engagement in Argumentative DialoguesProceedings of the 12th International Conference on Human-Agent Interaction10.1145/3687272.3688315(224-232)Online publication date: 24-Nov-2024
          • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
          • (2024)A Study on Integrating Representational Gestures into Automatically Generated Embodied ExplanationsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673919(1-5)Online publication date: 16-Sep-2024
          • (2024)Initial Study on Robot Emotional Expression Using ManpuCompanion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction10.1145/3610978.3640652(463-467)Online publication date: 11-Mar-2024
          • (2024)Data-driven Communicative Behaviour Generation: A SurveyACM Transactions on Human-Robot Interaction10.1145/360923513:1(1-39)Online publication date: 30-Jan-2024
          • (2024)Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00201(1952-1964)Online publication date: 17-Jun-2024
          • (2024)Design and Implementation of a Storytelling Robot: Preliminary Evaluation of a GAN-Based Model for Co-Speech Gesture GenerationAmbient Assisted Living10.1007/978-3-031-77318-1_25(373-385)Online publication date: 20-Dec-2024
          • (2023)Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion ModelsACM Transactions on Graphics10.1145/359245842:4(1-20)Online publication date: 26-Jul-2023
          • (2023)A Comprehensive Review of Data‐Driven Co‐Speech Gesture GenerationComputer Graphics Forum10.1111/cgf.1477642:2(569-596)Online publication date: 23-May-2023
          • Show More Cited By

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media