[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3577190.3614175acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Open access

Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion

Published: 09 October 2023 Publication History

Abstract

Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).

Supplemental Material

MP4 File
presentation video
PDF File
Supplementary file of the paper
ZIP File
zip file including videos in the main paper

References

[1]
Hyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo, and Songhwai Oh. 2018. Text2Action: Generative Adversarial Synthesis from Language to Action. In 2018 IEEE International Conference on Robotics and Automation. 1–5. https://doi.org/10.1109/ICRA.2018.8460608
[2]
N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol. 2022. TEACH: Temporal Action Composition for 3D Humans. In 2022 International Conference on 3D Vision. 414–423. https://doi.org/10.1109/3DV57658.2022.00053
[3]
Jernej Barbič, Alla Safonova, Jia-Yu Pan, Christos Faloutsos, Jessica K. Hodgins, and Nancy S. Pollard. 2004. Segmenting Motion Capture Data into Distinct Behaviors. In Proceedings of Graphics Interface 2004. 185–194.
[4]
Silvia Chiappa and Jan Peters. 2010. Movement extraction by detecting dynamics switches and repetitions. In Advances in Neural Information Processing Systems, Vol. 23. https://proceedings.neurips.cc/paper/2010/file/704afe073992cbe4813cae2f7715336f-Paper.pdf
[5]
Marco Cuturi and Mathieu Blondel. 2017. Soft-DTW: A Differentiable Loss Function for Time-Series. In Proceedings of the 34th International Conference on Machine Learning - Volume 70(ICML’17). 894–903.
[6]
Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. 2021. Synthesis of Compositional Animations From Textual Descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1396–1406.
[7]
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5152–5161.
[8]
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. 580–597. https://doi.org/10.1007/978-3-031-19833-5_34
[9]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (11 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[10]
Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics 35 (07 2016), 1–11. https://doi.org/10.1145/2897824.2925975
[11]
Hideki Kadone and Yoshihiko Nakamura. 2006. Segmentation, Memorization, Recognition and Abstraction of Humanoid Motions Based on Correlations and Associative Memory. In 2006 6th IEEE-RAS International Conference on Humanoid Robots. 1–6. https://doi.org/10.1109/ICHR.2006.321355
[12]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980
[13]
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. http://arxiv.org/abs/1312.6114
[14]
Yan Li, Tianshu Wang, and Heung-Yeung Shum. 2002. Motion Texture: A Two-Level Statistical Model for Character Motion Synthesis. ACM Transactions on Graphics 21, 465–472. https://doi.org/10.1145/566570.566604
[15]
Angela S. Lin, Wu Lemeng, Corona Rodolfo, Tai Kevin, Huang Qixing, and Raymond J. Mooney. 2018. Generating Animated Videos of Human Activities from Natural Language Descriptions. In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS. http://www.cs.utexas.edu/users/ai-labpub-view.php?PubID=127730
[16]
Wei Mao, Miaomiao Liu, and Mathieu Salzmann. 2022. Weakly-Supervised Action Transition Learning for Stochastic Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8151–8160.
[17]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532–1543. https://doi.org/10.3115/v1/D14-1162
[18]
Mathis Petrovich, Michael J. Black, and Gül Varol. 2022. TEMOS: Generating Diverse Human Motions from Textual Descriptions. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. 480–497. https://doi.org/10.1007/978-3-031-20047-2_28
[19]
Matthias Plappert, Christian Mandery, and Tamim Asfour. 2018. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robotics and Autonomous Systems 109 (2018), 13 – 26. https://doi.org/10.1016/j.robot.2018.07.006
[20]
Tadahiro Taniguchi and Shogo Nagasaka. 2011. Double articulation analyzer for unsegmented human motion using Pitman-Yor language model and infinite hidden Markov model. In 2011 IEEE/SICE International Symposium on System Integration. 250–255. https://doi.org/10.1109/SII.2011.6147455
[21]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[22]
Nguyen Tan Viet Tuyen, Armagan Elibol, and Nak Young Chong. 2020. Learning from Humans to Generate Communicative Gestures for Social Robots. In 2020 17th International Conference on Ubiquitous Robots. https://doi.org/10.1109/UR49135.2020.9144985
[23]
Tatsuro Yamada, Hiroyuki Matsunaga, and Tetsuya Ogata. 2018. Paired Recurrent Autoencoders for Bidirectional Translation Between Robot Actions and Linguistic Descriptions. IEEE Robotics and Automation Letters 3, 4 (2018), 3441–3448. https://doi.org/10.1109/LRA.2018.2852838

Index Terms

  1. Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction
      October 2023
      858 pages
      ISBN:9798400700552
      DOI:10.1145/3577190
      This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 October 2023

      Check for updates

      Author Tags

      1. motion content control
      2. natural language
      3. variational autoencoder

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • JST Moonshot R&D
      • CREST

      Conference

      ICMI '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 202
        Total Downloads
      • Downloads (Last 12 months)130
      • Downloads (Last 6 weeks)20
      Reflects downloads up to 17 Jan 2025

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media