[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3394171.3413548acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

Published: 12 October 2020 Publication History

Abstract

In this paper, we address self-supervised representation learning from human skeletons for action recognition. Previous methods, which usually learn feature presentations from a single reconstruction task, may come across the overfitting problem, and the features are not generalizable for action recognition. Instead, we propose to integrate multiple tasks to learn more general representations in a self-supervised manner. To realize this goal, we integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects. Skeleton dynamics can be modeled through motion prediction by predicting the future sequence. And temporal patterns, which are critical for action recognition, are learned through solving jigsaw puzzles. We further regularize the feature space by contrastive learning. Besides, we explore different training strategies to utilize the knowledge from self-supervised tasks for action recognition. We evaluate our multi-task self-supervised learning approach with action classifiers trained under different configurations, including unsupervised, semi-supervised and fully-supervised settings. Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition, demonstrating the superiority of our method in learning more discriminative and general features. Our project website is available at https://langlandslin.github.io/projects/MSL/.

Supplementary Material

MP4 File (3394171.3413548.mp4)
The paper is titled MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In this paper, we address self-supervised learning to use multiple tasks to learn more general features from human skeletons for action recognition. Previous methods, which usually learn features from a single reconstruction task, may come across the overfitting problem. Instead, we propose to use multiple tasks to learn more general features in a self-supervised manner. Basically, for the self-supervised tasks, we use motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features. Besides, we explore different training strategies to use the knowledge from self-supervised tasks for supervised action recognition, moving pretraining strategy and jointly training strategy. We evaluate our multi-task self-supervised learning approach in unsupervised, semi-supervised, supervised and transfer learning. Our experiments show remarkable performance for action recognition.

References

[1]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).
[2]
Girum G Demisse, Konstantinos Papadopoulos, Djamila Aouada, and Bjorn Ottersten. 2018. Pose Encoding for Robust Skeleton-Based Action Recognition. (2018), 188--194.
[3]
Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In Proc. IEEE Int'l Conference on Computer Vision. 1422--1430.
[4]
Yong Du, Yun Fu, and Liang Wang. 2016. Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Transactions on Image Processing 25, 7 (2016), 3010--3022.
[5]
Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1110--1118.
[6]
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, Feb (2010), 625--660.
[7]
Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. 2017. Self-supervised video representation learning with odd-one-out networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 3636--3645.
[8]
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. In Proc. Int'l Conference on Learning Representations.
[9]
Yusuke Goutsu,Wataru Takano, and Yoshihiko Nakamura. 2015. Motion Recognition Employing Multiple Kernel Learning of Fisher Vectors Using Local Skeleton Features. In Proc. Int'l Conference for Machine Learning Workshops. 79--86.
[10]
R. Hadsell, S. Chopra, and Y. LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1735--1742.
[11]
Kaiming He, Ross Girshick, and Piotr Dollár. 2019. Rethinking imagenet pretraining. In Proc. IEEE Int?l Conference on Computer Vision. 4918--4927.
[12]
Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. 2001. Gradient flowin recurrent nets: the difficulty of learning long-term dependencies.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (1997), 1735--1780.
[14]
Eric Jang, Coline Devin, Vincent Vanhoucke, and Sergey Levine. 2018. Grasp2vec: Learning object representations from self-supervised grasping. arXiv preprint arXiv:1811.06964 (2018).
[15]
Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid. 2017. A New Representation of Skeleton Sequences for 3D Action Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 4570--4579.
[16]
Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-supervised video representation learning with space-time cubic puzzles. In Proc. AAAI Conference on Artificial Intelligence, Vol. 33. 8545--8552.
[17]
Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised representation learning by sorting sequences. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 667--676.
[18]
Jiaying Liu, Sijie Song, Chunhui Liu, Yanghao Li, and Yueyu Hu. 2020. A Benchmark Dataset and Comparison Study for Multi-Modal Human Action Analytics. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2, Article 41 (2020), 24 pages.
[19]
Mengyuan Liu, Hong Liu, and Chen Chen. 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition 68, 68 (2017), 346--362.
[20]
Fengjun Lv and Ramakant Nevatia. 2006. Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In Proc. European Conference on Computer Vision. 359--372.
[21]
Ishan Misra, C Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and learn: unsupervised learning using temporal order verification. In Proc. European Conference on Computer Vision. 527--544.
[22]
Whitney K Newey. 1988. Adaptive estimation of regression models via moment restrictions. Journal of Econometrics 38, 3 (1988), 301--339.
[23]
Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proc. European Conference on Computer Vision. 69--84.
[24]
Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. 2018. Boosting self-supervised learning via knowledge transfer. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 9359--9367.
[25]
Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proc. European Conference on Computer Vision. 631--648.
[26]
Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).
[27]
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+ D: A large scale dataset for 3D human activity analysis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1010--1019.
[28]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 12026--12035.
[29]
Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Tieniu Tan. 2019. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1227--1236.
[30]
Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2017. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. In Proc. AAAI Conference on Artificial Intelligence. 4263--4270.
[31]
Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2018. Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection. IEEE Transactions on Image Processing 27, 7 (2018), 3459--3471.
[32]
Kun Su, Xiulong Liu, and Eli Shlizerman. 2019. PREDICT CLUSTER: Unsupervised Skeleton Based Action Recognition. ArXiv abs/1911.12409 (2019).
[33]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. (2014), 3104--3112.
[34]
Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3d skeletons as points in a lie group. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 588--595.
[35]
Raviteja Vemulapalli and Rama Chellapa. 2016. Rolling rotations for recognizing human actions from 3D skeletal data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 4471--4479.
[36]
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Proc. Advances in Neural Information Processing Systems. 613--621.
[37]
Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 4006--4015.
[38]
Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1290--1297.
[39]
JiangWang, Xiaohan Nie, Yin Xia, YingWu, and Songchun Zhu. 2014. Cross-View Action Modeling, Learning, and Recognition. (2014), 2649--2656.
[40]
Peng Wang, Yuanzhouhan Cao, Chunhua Shen, Lingqiao Liu, and Heng Tao Shen. 2017. Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology 27, 12 (2017), 2613--2622.
[41]
Chen Wei, Lingxi Xie, Xutong Ren, Yingda Xia, Chi Su, Jiaying Liu, Qi Tian, and Alan L Yuille. 2019. Iterative reorganization with weak spatial constraints: Solving arbitrary jigsaw puzzles for unsupervised representation learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1910--1919.
[42]
Daniel Weinland, Remi Ronfard, and Edmond Boyerc. 2011. A survey of visionbased methods for action representation, segmentation and recognition. Computer Vision and Image Understanding 115, 2 (2011), 224--241.
[43]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proc. AAAI Conference on Artificial Intelligence. 7444--7452.
[44]
Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. 2019. S4l: Self-supervised semi-supervised learning. In Proc. IEEE Int'l Conference on Computer Vision. 1476--1485.
[45]
Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and Nanning Zheng. 2017. View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data. In Proc. IEEE Int'l Conference on Computer Vision. 2117--2126.
[46]
Pengfei Zhang, Jianru Xue, Cuiling Lan, Wenjun Zeng, Zhanning Gao, and Nanning Zheng. 2018. Adding attentiveness to the neurons in recurrent neural networks. In Proc. European Conference on Computer Vision. 135--151.
[47]
Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In Proc. European Conference on Computer Vision. 649--666.
[48]
Nenggan Zheng, Jun Wen, Risheng Liu, Liangqu Long, Jianhua Dai, and Zhefeng Gong. 2018. Unsupervised Representation Learning With Long-Term Dynamics for Skeleton Based Action Recognition. In Proc. AAAI Conference on Artificial Intelligence. 2644--2651.
[49]
Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, and Xiaohui Xie. 2016. Co-occurrence Feature Learning for Skeleton based Action Recognition using Regularized Deep LSTM Networks. In Proc. AAAI Conference on Artificial Intelligence. 3697--3703.

Cited By

View all
  • (2025)A unified framework for unsupervised action learning via global-to-local motion transformerPattern Recognition10.1016/j.patcog.2024.111118159(111118)Online publication date: Mar-2025
  • (2024)OTM-HC: Enhanced Skeleton-Based Action Representation via One-to-Many Hierarchical Contrastive LearningAI10.3390/ai50401065:4(2170-2186)Online publication date: 1-Nov-2024
  • (2024)Unsupervised Temporal Adaptation in Skeleton-Based Human Action RecognitionAlgorithms10.3390/a1712058117:12(581)Online publication date: 16-Dec-2024
  • Show More Cited By

Index Terms

  1. MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. action recognition
    2. multi-task
    3. self-supervised learning

    Qualifiers

    • Research-article

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)141
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 05 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)A unified framework for unsupervised action learning via global-to-local motion transformerPattern Recognition10.1016/j.patcog.2024.111118159(111118)Online publication date: Mar-2025
    • (2024)OTM-HC: Enhanced Skeleton-Based Action Representation via One-to-Many Hierarchical Contrastive LearningAI10.3390/ai50401065:4(2170-2186)Online publication date: 1-Nov-2024
    • (2024)Unsupervised Temporal Adaptation in Skeleton-Based Human Action RecognitionAlgorithms10.3390/a1712058117:12(581)Online publication date: 16-Dec-2024
    • (2024)Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognitionIntelligent Data Analysis10.3233/IDA-23039928:4(921-941)Online publication date: 17-Jul-2024
    • (2024)[Paper] PSp-Transformer: A Transformer with Data-level Probabilistic Sparsity for Action Representation LearningITE Transactions on Media Technology and Applications10.3169/mta.12.12312:1(123-132)Online publication date: 2024
    • (2024)Enhancing human behavior recognition with spatiotemporal graph convolutional neural networks and skeleton sequencesEURASIP Journal on Advances in Signal Processing10.1186/s13634-024-01156-w2024:1Online publication date: 7-May-2024
    • (2024)How to Improve Video Analytics with Action Recognition: A SurveyACM Computing Surveys10.1145/367901157:1(1-36)Online publication date: 7-Oct-2024
    • (2024)Multi-Task Spatial-Temporal Graph Auto-Encoder for Hand Motion DenoisingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.333786830:10(6754-6769)Online publication date: Oct-2024
    • (2024)Self-Supervised 3D Action Representation Learning With Skeleton Cloud ColorizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332546346:1(509-524)Online publication date: Jan-2024
    • (2024)GRA: Graph Representation Alignment for Semi-Supervised Action RecognitionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.334759335:9(11896-11905)Online publication date: Sep-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media