Actions as points: a simple and efficient detector for skeleton-based temporal action detection

Jianhua Yang¹,
Ke Wang¹ &
Ruifeng Li¹

483 Accesses
2 Altmetric
Explore all metrics

Abstract

Temporal action detection, aiming to determine the fragment and category of a human action simultaneously from continuous data stream, is still a challenge issue in the field of human–robot interaction, somatosensory game and security monitoring. In this paper, we present a novel one-stage skeleton-based TAD method, Action-CenterNet(ACNet) with a simple anchor-free and fully convolutional encoder-decoder pipeline. Our approach encodes skeleton position and motion data sequence from multiple persons into multi-channel skeleton images which are subsequently preprocessed by view invariant transform and translation-scale invariant. ACNet models each action fragment as a center point along the time dimension and generates a keypoint heatmap to locate and classify action fragments. To ensure the accurate temporal coordinates, the discretization error caused by the output stride of network is also learned. Compared with two-stage methods, ACNet is end-to-end differential and flexible. ACNet is also an anchor-free method avoiding the drawbacks of anchor boxes used in anchor-based TAD methods. Experimental results on PKU-MMD dataset, NTU RGB-D dataset and HITvs dataset reveal the excellent performance of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Insight on Attention Modules for Skeleton-Based Action Recognition

Beyond coordinate attention: spatial-temporal recalibration and channel scaling for skeleton-based action recognition

Article 29 August 2023

Spatial-temporal graph transformer network for skeleton-based temporal action segmentation

Article 17 October 2023

References

Chao,Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)
Cui, R., Zhu, A., Jingran, W., Hua, G.: Skeleton-based attention-aware spatial-temporal model for action detection and recognition. IET Comput. Vis. 14(5), 177–184 (2020)
Article Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hou, R., Chen, C., Shah, M.: An end-to-end 3d convolutional neural network for action detection and segmentation in videos. arXiv preprint arXiv:1712.01111 (2017)
Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14, 201–211 (1973)
Article Google Scholar
Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750 (2018)
Li, B., Chen, H., Chen, Y., Dai, Y., He, M.: Skeleton boxes: Solving skeleton based action detection with a single deep convolutional neural network. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, pp. 613–616 (2017)
Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055 (2018)
Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., Liu, J.: Online human action detection using joint classification-regression recurrent neural networks. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pp. 203–220. Springer (2016)
Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pp. 68–84. Springer (2020)
Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 988–996 (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 21–37. Springer (2016)
Redmon, J, Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Processing Syst. 28 (2015)
Rostami, Z., Afrasiabi, M., Khotanlou, H.: Skeleton-based action recognition using spatio-temporal features with convolutional neural networks. In: 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI). IEEE, pp. 0583–0587 (2017)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019 (2016)
Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743 (2017)
Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based LSTM networks for 3d action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)
Article MathSciNet MATH Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Wang, H., Wang, L.: Learning robust representations using recurrent neural networks for skeleton based action classification and detection. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, pp. 591–596 (2017)
Wang, H., Wang, L.: Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Process. 27(9), 4382–4394 (2018)
Article MathSciNet Google Scholar
Wu, J., Li, Y., Wang, L., Wang, K., Li, R., Zhou, T.: Skeleton based temporal action detection with yolo. In: Journal of Physics: Conference Series, vol. 1237, pp. 022087 (2019)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence, (2018)
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412 (2018)
Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14333–14342 (2020)
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (62176072), National Key Research and Development Program of China (No.2019YFB1310004) and Self-Planned Task NO.SKLRS202111B of State Key Laboratory of Robotics and System (HIT).

Author information

Authors and Affiliations

State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, 150001, China
Jianhua Yang, Ke Wang & Ruifeng Li

Authors

Jianhua Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ke Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ruifeng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruifeng Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, J., Wang, K. & Li, R. Actions as points: a simple and efficient detector for skeleton-based temporal action detection. Machine Vision and Applications 34, 35 (2023). https://doi.org/10.1007/s00138-023-01377-3

Download citation

Received: 13 April 2021
Revised: 19 December 2021
Accepted: 13 January 2023
Published: 07 March 2023
DOI: https://doi.org/10.1007/s00138-023-01377-3

Actions as points: a simple and efficient detector for skeleton-based temporal action detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Insight on Attention Modules for Skeleton-Based Action Recognition

Beyond coordinate attention: spatial-temporal recalibration and channel scaling for skeleton-based action recognition

Spatial-temporal graph transformer network for skeleton-based temporal action segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Actions as points: a simple and efficient detector for skeleton-based temporal action detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Insight on Attention Modules for Skeleton-Based Action Recognition

Beyond coordinate attention: spatial-temporal recalibration and channel scaling for skeleton-based action recognition

Spatial-temporal graph transformer network for skeleton-based temporal action segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation