research-article

MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

Authors:

Lilang Lin,

Sijie Song,

Wenhan Yang,

Jiaying LiuAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 2490 - 2498

https://doi.org/10.1145/3394171.3413548

Published: 12 October 2020 Publication History

Get Access

Abstract

In this paper, we address self-supervised representation learning from human skeletons for action recognition. Previous methods, which usually learn feature presentations from a single reconstruction task, may come across the overfitting problem, and the features are not generalizable for action recognition. Instead, we propose to integrate multiple tasks to learn more general representations in a self-supervised manner. To realize this goal, we integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects. Skeleton dynamics can be modeled through motion prediction by predicting the future sequence. And temporal patterns, which are critical for action recognition, are learned through solving jigsaw puzzles. We further regularize the feature space by contrastive learning. Besides, we explore different training strategies to utilize the knowledge from self-supervised tasks for action recognition. We evaluate our multi-task self-supervised learning approach with action classifiers trained under different configurations, including unsupervised, semi-supervised and fully-supervised settings. Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition, demonstrating the superiority of our method in learning more discriminative and general features. Our project website is available at https://langlandslin.github.io/projects/MSL/.

Supplementary Material

MP4 File (3394171.3413548.mp4)

The paper is titled MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In this paper, we address self-supervised learning to use multiple tasks to learn more general features from human skeletons for action recognition. Previous methods, which usually learn features from a single reconstruction task, may come across the overfitting problem. Instead, we propose to use multiple tasks to learn more general features in a self-supervised manner. Basically, for the self-supervised tasks, we use motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features. Besides, we explore different training strategies to use the knowledge from self-supervised tasks for supervised action recognition, moving pretraining strategy and jointly training strategy. We evaluate our multi-task self-supervised learning approach in unsupervised, semi-supervised, supervised and transfer learning. Our experiments show remarkable performance for action recognition.

Download
70.07 MB

References

[1]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Spatio-Temporal Self-supervision for Few-Shot Action Recognition

Exploring Relations in Untrimmed Videos for Self-Supervised Learning

Co-learning: Learning from Noisy Labels with Self-supervision

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations