[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ICRA.2018.8460608guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Text2Action: Generative Adversarial Synthesis from Language to Action

Published: 21 May 2018 Publication History

Abstract

In this paper, we propose a generative model which learns the relationship between language and human action in order to generate a human action sequence given a sentence describing human behavior. The proposed generative model is a generative adversarial network (GAN), which is based on the sequence to sequence (SEQ2SEQ) model. Using the proposed generative network, we can synthesize various actions for a robot or a virtual agent using a text encoder recurrent neural network (RNN) and an action decoder RNN. The proposed generative network is trained from 29,770 pairs of actions and sentence annotations extracted from MSR-Video-to-Text (MSR-VTT), a large-scale video dataset. We demonstrate that the network can generate human-like actions which can be transferred to a Baxter robot, such that the robot performs an action based on a provided sentence. Results show that the proposed generative network correctly models the relationship between language and action and can generate a diverse set of actions from the same sentence.

References

[1]
E. Ribes-Iñesta, “Human behavior as language: some thoughts on wittgenstein,” Behavior and Philosophy, pp. 109–121, 2006.
[2]
W. Takano and Y. Nakamura, “Symbolically structured database for human whole body motions based on association between motion symbols and motion words,” Robotics and Autonomous Systems, vol. 66, pp. 75–85, 2015.
[3]
M. Plappert, C. Mandery, and T. Asfour, “The KIT motion-language dataset,” Big Data, vol. 4, no. 4, pp. 236–252, 2016.
[4]
W. Takano and Y. Nakamura, “Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions,” The International Journal of Robotics Research, vol. 34, no. 10, pp. 1314–1328, 2015.
[5]
M. Plappert, C. Mandery, and T. Asfour, “Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks,” arXiv preprint arXiv:, 2017.
[6]
S. R. Eddy, “Hidden markov models,” Current Opinion in Structural Biology, vol. 6, no. 3, pp. 361–365, 1996.
[7]
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems, Dec. 2014.
[8]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, Dec. 2014.
[9]
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International Conference on Machine Learning, June 2016.
[10]
C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
[11]
A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in Neural Information Processing Systems, Dec. 2016.
[12]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[13]
O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton, “Grammar as a foreign language,” in Advances in Neural Information Processing Systems, Dec. 2015.
[14]
J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
[15]
S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
[16]
X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis, “Sparseness meets deepness: 3d human pose estimation from monocular video,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
[17]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compo-sitionality,” in Advances in Neural Information Processing Systems, Dec. 2013.
[18]
H. Ahn, T. Ha, Y. Choi, H. Yoo, and S. Oh, “Text2Action: Generative adversarial synthesis from language to action,” arXiv preprint arXiv:, 2017.
[19]
P. Steadman. (2015) baxter-teleoperation. [Online]. Available: https://github.com/ptsteadman/baxter-teleoperation.

Cited By

View all
  • (2024)AdaptControl: Adaptive Human Motion Control and Generation via User Prompt and Spatial Trajectory GuidanceProceedings of the 5th International Workshop on Human-centric Multimedia Analysis10.1145/3688865.3689476(13-22)Online publication date: 28-Oct-2024
  • (2024)CPoser: An Optimization-after-Parsing Approach for Text-to-Pose Generation Using Large Language ModelsACM Transactions on Graphics10.1145/368793243:6(1-13)Online publication date: 19-Dec-2024
  • (2024)Decoupling Contact for Fine-Grained Motion Style TransferSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687609(1-11)Online publication date: 3-Dec-2024
  • Show More Cited By

Index Terms

  1. Text2Action: Generative Adversarial Synthesis from Language to Action
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      2018 IEEE International Conference on Robotics and Automation (ICRA)
      May 2018
      5954 pages

      Publisher

      IEEE Press

      Publication History

      Published: 21 May 2018

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 25 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)AdaptControl: Adaptive Human Motion Control and Generation via User Prompt and Spatial Trajectory GuidanceProceedings of the 5th International Workshop on Human-centric Multimedia Analysis10.1145/3688865.3689476(13-22)Online publication date: 28-Oct-2024
      • (2024)CPoser: An Optimization-after-Parsing Approach for Text-to-Pose Generation Using Large Language ModelsACM Transactions on Graphics10.1145/368793243:6(1-13)Online publication date: 19-Dec-2024
      • (2024)Decoupling Contact for Fine-Grained Motion Style TransferSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687609(1-11)Online publication date: 3-Dec-2024
      • (2024)MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete RepresentationsACM Transactions on Graphics10.1145/365813743:4(1-21)Online publication date: 19-Jul-2024
      • (2024)Flexible Motion In-betweening with Diffusion ModelsACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657414(1-9)Online publication date: 13-Jul-2024
      • (2023)Act as you wishProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666803(15497-15518)Online publication date: 10-Dec-2023
      • (2023)The KCL-SAIR team's entry to the GENEA Challenge 2023 Exploring Role-based Gesture Generation in Dyadic Interactions: Listener vs. SpeakerCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616555(214-219)Online publication date: 9-Oct-2023
      • (2023)Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar MotionProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614175(292-300)Online publication date: 9-Oct-2023
      • (2022)AvatarCLIPACM Transactions on Graphics10.1145/3528223.353009441:4(1-19)Online publication date: 22-Jul-2022
      • (2021)Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based MethodProceedings of the 3rd ACM International Conference on Multimedia in Asia10.1145/3469877.3495644(1-7)Online publication date: 1-Dec-2021
      • Show More Cited By

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media