research-article

Shuffle-invariant Network for Action Recognition in Videos

Authors:

Zhe Li,

Jing-Hua LiuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 3

Article No.: 69, Pages 1 - 18

https://doi.org/10.1145/3485665

Published: 04 March 2022 Publication History

Get Access

Abstract

The local key features in video are important for improving the accuracy of human action recognition. However, most end-to-end methods focus on global feature learning from videos, while few works consider the enhancement of the local information in a feature. In this article, we discuss how to automatically enhance the ability to discriminate the local information in an action feature and improve the accuracy of action recognition. To address these problems, we assume that the critical level of each region for the action recognition task is different and will not change with the region location shuffle. We therefore propose a novel action recognition method called the shuffle-invariant network. In the proposed method, the shuffled video is generated by regular region cutting and random confusion to enhance the input data. The proposed network adopts the multitask framework, which includes one feature backbone network and three task branches: local critical feature shuffle-invariant learning, adversarial learning, and an action classification network. To enhance the local features, the feature response of each region is predicted by a local critical feature learning network. To train this network, an L1-based critical feature shuffle-invariant loss is defined to ensure that the ordered feature response list of these regions remains unchanged after region location shuffle. Then, the adversarial learning is applied to eliminate the noise caused by the region shuffle. Finally, the action classification network combines these two tasks to jointly guide the training of the feature backbone network and obtain more effective action features. In the testing phase, only the action classification network is applied to identify the action category of the input video. We verify the proposed method on the HMDB51 and UCF101 action datasets. Several ablation experiments are constructed to verify the effectiveness of each module. The experimental results show that our approach achieves the state-of-the-art performance.

References

[1]

J. Carreira and A. Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724–4733. DOI:DOI:

Abstract

References

Cited By

Index Terms

Recommendations

Action recognition in still images by learning spatial interest regions from videos

A local descriptor based on Laplacian pyramid coding for action recognition

Weighted feature trajectories and concatenated bag-of-features for action recognition

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations