[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Shuffle-invariant Network for Action Recognition in Videos

Published: 04 March 2022 Publication History

Abstract

The local key features in video are important for improving the accuracy of human action recognition. However, most end-to-end methods focus on global feature learning from videos, while few works consider the enhancement of the local information in a feature. In this article, we discuss how to automatically enhance the ability to discriminate the local information in an action feature and improve the accuracy of action recognition. To address these problems, we assume that the critical level of each region for the action recognition task is different and will not change with the region location shuffle. We therefore propose a novel action recognition method called the shuffle-invariant network. In the proposed method, the shuffled video is generated by regular region cutting and random confusion to enhance the input data. The proposed network adopts the multitask framework, which includes one feature backbone network and three task branches: local critical feature shuffle-invariant learning, adversarial learning, and an action classification network. To enhance the local features, the feature response of each region is predicted by a local critical feature learning network. To train this network, an L1-based critical feature shuffle-invariant loss is defined to ensure that the ordered feature response list of these regions remains unchanged after region location shuffle. Then, the adversarial learning is applied to eliminate the noise caused by the region shuffle. Finally, the action classification network combines these two tasks to jointly guide the training of the feature backbone network and obtain more effective action features. In the testing phase, only the action classification network is applied to identify the action category of the input video. We verify the proposed method on the HMDB51 and UCF101 action datasets. Several ablation experiments are constructed to verify the effectiveness of each module. The experimental results show that our approach achieves the state-of-the-art performance.

References

[1]
J. Carreira and A. Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724–4733. DOI:DOI:
[2]
Y. Chen, Y. Bai, W. Zhang, and T. Mei. 2019. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161. DOI:DOI:
[3]
Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool. 2017. Temporal 3D convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200.
[4]
Ali Diba, Mohsen Fayyaz, Vivek Sharma, M. Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall, and Luc Van Gool. 2018. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision (ECCV). 284–299.
[5]
Ali Diba, Vivek Sharma, and Luc Van Gool. 2017. Deep temporal linear encoding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2329–2338.
[6]
Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203–213.
[7]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.
[8]
W. Ge, X. Lin, and Y. Yu. 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3029–3038. DOI:DOI:
[9]
K. Hara, H. Kataoka, and Y. Satoh. 2017. Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 3154–3160. DOI:DOI:
[10]
K. Hara, H. Kataoka, and Y. Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6546–6555. DOI:DOI:
[11]
Vincent Jacquot, Zhuofan Ying, and Gabriel Kreiman. 2020. Can deep learning recognize subtle human activities? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14244–14253.
[12]
Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. STM: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2000–2009.
[13]
A. Kamel, B. Sheng, P. Yang, P. Li, R. Shen, and D. D. Feng. 2019. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 49, 9 (2019), 1806–1819. DOI:DOI:
[14]
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision. IEEE, 2556–2563.
[15]
Dong Li, Zhaofan Qiu, Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2021. Representing videos as discriminative sub-graphs for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3310–3319.
[16]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. TEA: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909–918.
[17]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083–7093.
[18]
Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C. van Gemert. 2021. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14892–14901.
[19]
Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. TEINet: Towards an efficient architecture for video recognition. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference. 11669–11676.
[20]
Effrosyni Mavroudi, Divya Bhaskara, Shahin Sefati, Haider Ali, and René Vidal. 2018. End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1558–1567.
[21]
J. Munro and D. Damen. 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 119–129. DOI:DOI:
[22]
Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2021. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6964–6974.
[23]
H. Rahmani and M. Bennamoun. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5833–5842. DOI:DOI:
[24]
Bin Ren, Mengyuan Liu, Runwei Ding, and Hong Liu. 2020. A survey on 3D skeleton-based action recognition using learning method. arxiv:2002.05907 [cs.CV].
[25]
Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. 2016. Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vis. 119, 3 (2016), 346–373.
[26]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014), 568–576.
[27]
Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. 2020. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia (MM’20). Association for Computing Machinery, New York, NY, 1625–1633. DOI:DOI:
[28]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
[29]
Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1390–1399.
[30]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
[31]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
[32]
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 5552–5561.
[33]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.
[34]
Jianfeng Xu, Kazuyuki Tasaka, and Hiromasa Yanagihara. 2018. Beyond two-stream: Skeleton-based three-stream networks for action recognition in videos. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR). IEEE, 1567–1573.
[35]
Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. 2020. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 591–600.
[36]
Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. 2019. A comprehensive survey of vision-based human action recognition methods. Sensors 19, 5 (2019). DOI:DOI:
[37]
Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R. Scott, and Limin Wang. 2020. V4D: 4D convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442.
[38]
X. Zhang, C. Xu, and D. Tao. 2020. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14321–14330. DOI:DOI:
[39]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.
[40]
Yizhou Zhou, Xiaoyan Sun, Chong Luo, Zheng-Jun Zha, and Wenjun Zeng. 2020. Spatiotemporal fusion in 3D CNNs: A probabilistic view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9829–9838.
[41]
Sijie Zhu, Taojiannan Yang, Matias Mendieta, and Chen Chen. 2020. A3D: Adaptive 3D networks for video action recognition. arXiv preprint arXiv:2011.12384.
[42]
Yi Zhu and Shawn Newsam. 2016. Depth2Action: Exploring embedded depth for large-scale action recognition. In Computer Vision – ECCV 2016 Workshops, Gang Hua and Hervé Jégou (Eds.). Springer International Publishing, Cham, 668–684.

Cited By

View all
  • (2024)Transductive classification via patch alignmentAI Communications10.3233/AIC-22017937:1(37-51)Online publication date: 21-Mar-2024
  • (2024)Towards Long Form Audio-visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672079Online publication date: 7-Jun-2024
  • (2024)SNIPPET: A Framework for Subjective Evaluation of Visual Explanations Applied to DeepFake DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366524820:8(1-29)Online publication date: 13-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
August 2022
478 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3505208
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2022
Accepted: 01 September 2021
Revised: 01 July 2021
Received: 01 January 2021
Published in TOMM Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Action recognition
  2. key region detection
  3. shuffle-invariant network
  4. adversarial learning
  5. critical feature sort loss

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Natural Science Foundation of China
  • National Key Research and Development Program of China
  • Natural Science Foundation of Fujian Province of China
  • Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)132
  • Downloads (Last 6 weeks)12
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Transductive classification via patch alignmentAI Communications10.3233/AIC-22017937:1(37-51)Online publication date: 21-Mar-2024
  • (2024)Towards Long Form Audio-visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672079Online publication date: 7-Jun-2024
  • (2024)SNIPPET: A Framework for Subjective Evaluation of Visual Explanations Applied to DeepFake DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366524820:8(1-29)Online publication date: 13-Jun-2024
  • (2024)Multimodal Score Fusion with Sparse Low-rank Bilinear Pooling for Egocentric Hand Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604420:7(1-22)Online publication date: 16-May-2024
  • (2024)Action Segmentation through Self-Supervised Video Features and Positional-Encoded EmbeddingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946520:9(1-23)Online publication date: 24-Feb-2024
  • (2024)Distributed Learning Mechanisms for Anomaly Detection in Privacy-Aware Energy Grid Management SystemsACM Transactions on Sensor Networks10.1145/3640341Online publication date: 17-Jan-2024
  • (2024)Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363947020:5(1-22)Online publication date: 7-Feb-2024
  • (2024)Pedestrian Attribute Recognition via Spatio-temporal Relationship Learning for Visual SurveillanceACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363262420:6(1-15)Online publication date: 8-Mar-2024
  • (2024)Label-Aware Calibration and Relation-Preserving in Visual Intention UnderstandingIEEE Transactions on Image Processing10.1109/TIP.2024.338025033(2627-2638)Online publication date: 2024
  • (2023)Relation with Free Objects for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361759620:2(1-19)Online publication date: 18-Oct-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media