[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3474085.3475438acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

TSA-Net: Tube Self-Attention Network for Action Quality Assessment

Published: 17 October 2021 Publication History

Abstract

In recent years, assessing action quality from videos has attracted growing attention in computer vision community and human-computer interaction. Most existing approaches usually tackle this problem by directly migrating the model from action recognition tasks, which ignores the intrinsic differences within the feature map such as foreground and background information. To address this issue, we propose a Tube Self-Attention Network (TSA-Net) for action quality assessment (AQA). Specifically, we introduce a single object tracker into AQA and propose the Tube Self-Attention Module (TSA), which can efficiently generate rich spatio-temporal contextual information by adopting sparse feature interactions. The TSA module is embedded in existing video networks to form TSA-Net. Overall, our TSA-Net is with the following merits: 1) High computational efficiency, 2) High flexibility, and 3) The state-of-the-art performance. Extensive experiments are conducted on popular action quality assessment datasets including AQA-7 and MTL-AQA. Besides, a dataset named Fall Recognition in Figure Skating (FR-FS) is proposed to explore the basic action assessment in the figure skating scene. Our TSA-Net achieves the Spearman's Rank Correlation of 0.8476 and 0.9393 on AQA-7 and MTL-AQA, respectively, which are the new state-of-the-art results. The results on FR-FS also verify the effectiveness of the TSA-Net. The code and FR-FS dataset are publicly available at https://github.com/Shunli-Wang/TSA-Net.

Supplementary Material

ZIP File (mfp1510aux.zip)
In this supplementary material, we analyze the computational complexity of TSA-Net in detail, and show more visualization cases.
MP4 File (Presentation video of TSA-Net.mp4)
Presentation video of the Tube Self-Attention Network (TSA-Net).

References

[1]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724--4733.
[2]
Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. A2-Nets: Double Attention Networks. In Advances in Neural Information Processing Systems (NeurlPS). 350--359.
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.
[4]
Hazel Doughty, Dima Damen, and Walterio Mayol-Cuevas. 2018. Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6057--6066.
[5]
Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen. 2019. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7854--7863.
[6]
Qi Fan, Wei Zhuo, Chi Keung Tang, and Yu Wing Tai. 2020. Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7]
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional Multi-person Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2353--2362.
[8]
Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 200--210.
[9]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1933--1941.
[10]
Kai Han, Jianyuan Guo, Chao Zhang, and Mingjian Zhu. 2018. Attribute-Aware Attention Model for Fine-Grained Representation Learning. In Proceedings of ACM international conference on Multimedia (ACM MM). 2040--2048.
[11]
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12]
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2020. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020), 2011--2023.
[13]
Zilong Huang, Xinggang Wang, Chang Huang, Yunchao Wei, Lichao Huang, and Wenyu Liu. 2020. CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1--1.
[14]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2010. 3D Convolutional Neural Networks for Human Action Recognition. In Proceedings of the 27th International Conference on International Conference on Machine Learning. 495--502.
[15]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1725--1732.
[16]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset., Article arXiv:1705.06950 (2017). arXiv:1705.06950
[17]
Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations.
[18]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 7082--7092.
[19]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the IEEE International Conference on Computer Vision (ECCV).
[20]
Anand Malpani, S. Swaroop Vedula, Chi Chiung Grace Chen, and Gregory D. Hager. 2014. Pairwise Comparison-Based Objective Score for Automated Skill Assessment of Segments in a Surgical Task. In Information Processing in Computer- Assisted Interventions. 138--147.
[21]
Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent Models of Visual Attention. In Advances in Neural Information Processing Systems (NeurlPS). 2204--2212.
[22]
Jia-Hui Pan, Jibin Gao, and Wei-Shi Zheng. 2019. Action Assessment by Joint Relation Graphs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 6330--6339.
[23]
Paritosh Parmar and Brendan Tran Morris. 2017. Learning to Score Olympic Events. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 76--84.
[24]
Paritosh Parmar and Brendan Tran Morris. 2019. Action quality assessment across multiple actions. In Winter Conference on Applications of Computer Vision(WACV). 1468--1476.
[25]
Paritosh Parmar and Brendan Tran Morris. 2019. What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 304--313.
[26]
Adam Paszke, S. Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary Devito, Zeming Lin, Alban Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems Workshops (NIPSW).
[27]
Hamed Pirsiavash, Antonio Torralba, and Carl Vondrick. 2014. Assessing the Quality of Actions. In Proceedings of the IEEE International Conference on Computer Vision (ECCV). 556--571.
[28]
A. Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training.
[29]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018).
[30]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 39, 6 (2017), 1137--1149.
[31]
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2613--2622.
[32]
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems (NeurlPS).
[33]
Yansong Tang, Zanlin Ni, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, and Jie Zhou. 2020. Uncertainty-Aware Score Distribution Learning for Action Quality Assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 9836--9845.
[34]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4489--4497.
[35]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6450--6459.
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems (NeurlPS). 6000--6010.
[37]
Vinay Venkataraman, Ioannis Vlachos, and Pavan Turaga. 2015. Dynamical Regularity for Action Analysis. In British Machine Vision Virtual Conference (BMVC). 67.1--67.12.
[38]
Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip H.S. Torr. 2019. Fast Online Object Tracking and Segmentation: A Unifying Approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 1328--1338.
[39]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7794--7803.
[40]
Chengming Xu, Yanwei Fu, Bing Zhang, Zitian Chen, Yu-Gang Jiang, and Xiangyang Xue. 2020. Learning to Score Figure Skating Sport Videos. IEEE Trans- actions on Circuits and Systems for Video Technology (2020), 4578--4590.
[41]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 7444--7452.
[42]
Kaiyu Yue, Ming Sun, Yuchen Yuan, Feng Zhou, Errui Ding, and Fuxin Xu. 2018. Compact Generalized Non-Local Network. In Advances in Neural Information Processing Systems (NeurlPS). 6511--6520.
[43]
Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, and Qianru Sun. 2020. Feature Pyramid Transformer. In Proceedings of the IEEE International Conference on Computer Vision (ECCV). 323--339.
[44]
Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. 2018. PSANet: Point-wise Spatial Attention Network for Scene Parsing. In Proceedings of the IEEE International Conference on Computer Vision (ECCV). 270--286.
[45]
Aneeq Zia, Yachna Sharma, Vinay Bettadapura, Eric L. Sarin, Thomas Ploetz, Mark A. Clements, and Irfan Essa. 2016. Automated video-based assessment of surgical skills for training and evaluation in medical schools. International Journal of Computer Assisted Radiology and Surgery (2016), 1623--1636.
[46]
Aneeq Zia, Chi Zhang, Xiaobin Xiong, and Anthony M Jarc. 2017. Temporal clustering of surgical activities in robot-assisted surgery. International Journal of Computer Assisted Radiology and Surgery (2017), 1171--1178.

Cited By

View all
  • (2025)Dual-referenced assistive network for action quality assessmentNeurocomputing10.1016/j.neucom.2024.128786614(128786)Online publication date: Jan-2025
  • (2025)Vision-based human action quality assessment: A systematic reviewExpert Systems with Applications10.1016/j.eswa.2024.125642263(125642)Online publication date: Mar-2025
  • (2024)PECoP: Parameter Efficient Continual Pretraining for Action Quality Assessment2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00012(42-52)Online publication date: 3-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action quality assessment
  2. self-attention mechanism
  3. video action analysis

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)74
  • Downloads (Last 6 weeks)10
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Dual-referenced assistive network for action quality assessmentNeurocomputing10.1016/j.neucom.2024.128786614(128786)Online publication date: Jan-2025
  • (2025)Vision-based human action quality assessment: A systematic reviewExpert Systems with Applications10.1016/j.eswa.2024.125642263(125642)Online publication date: Mar-2025
  • (2024)PECoP: Parameter Efficient Continual Pretraining for Action Quality Assessment2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00012(42-52)Online publication date: 3-Jan-2024
  • (2024)Self-Supervised Sub-Action Parsing Network for Semi-Supervised Action Quality AssessmentIEEE Transactions on Image Processing10.1109/TIP.2024.346887033(6057-6070)Online publication date: 2024
  • (2024)Multimodal Action Quality AssessmentIEEE Transactions on Image Processing10.1109/TIP.2024.336213533(1600-1613)Online publication date: 1-Jan-2024
  • (2024)Learning Sparse Temporal Video Mapping for Action Quality Assessment in Floor GymnasticsIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2024.339807273(1-11)Online publication date: 2024
  • (2024)Continual Action Assessment via Task-Consistent Score-Discriminative Feature Distribution ModelingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339669234:10(9112-9124)Online publication date: Oct-2024
  • (2024)Spectral-Wise Implicit Neural Representation for Hyperspectral Image ReconstructionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331836634:5(3714-3727)Online publication date: May-2024
  • (2024)CPR-CLIP: Multimodal Pre-Training for Composite Error Recognition in CPR TrainingIEEE Signal Processing Letters10.1109/LSP.2023.334620731(211-215)Online publication date: 2024
  • (2024)CPR-Coach: Recognizing Composite Error Actions Based on Single-Class Training2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01777(18782-18792)Online publication date: 16-Jun-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media