Abstract
Action recognition is a fundamental and challenging task in computer vision. In recent years, optical flow, as the auxiliary information of frames in a video, has been widely applied to action recognition because of its advantage of utilizing the motion information of video data. However, existing methods only fuse the score of classification probabilities of the two streams; they do not consider the interaction between the image frames and the optical flows. In addition, the other important challenges lie in capturing significant motion information to be able to recognize the action. To overcome these problems, an action recognition model based on a multi-view temporal attention mechanism is proposed in this paper. Specifically, global temporal attention pooling is first designed to fuse multiple frame image features, where more attention is given to discriminative frames. Second, considering the complementarity of the image frame and optical flow, feature-level multi-view fusion methods are proposed. Experiments on three widely used benchmark datasets on action recognition show that our method outperforms other existing state-of-the-art methods. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors, such as the temporal attention pooling strategy, multi-view feature fusion and network architecture. The promising experimental results demonstrate that introducing the temporal attention layer and feature-level multi-view fusion methods is of great effectiveness and overcomes the shortcomings of classical two-stream networks to some extent. Specifically, the proposed method has the following advantages. First, the temporal attention layer can accurately capture key frames that are more conducive to recognizing actions. Second, two kinds of features from image frames and optical flows are combined to make full use of their complementarity. Finally, a variety of fusion methods are employed for feature-level fusion instead of straightforward score fusion.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Gan C, Wang N, Yang Y, Yeung DY, Hauptmann AG. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. (pp. 2568-2577).
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 2014. p. 568–576.
Wang H, Schmid C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 2013. pp. 3551-3558.
Wang L, Qiao Y, Tang X. Motionlets: Mid-level 3d parts for human motion recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2013. pp. 2674-2681.
Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. pp. 4305–4314.
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. pp. 4694–4702.
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer. 2016. p. 20–36.
Feichtenhofer C, Pinz A, Wildes RP. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 4768–4777.
Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 1933–1941.
Wang Y, Long M, Wang J, Yu PS. Spatiotemporal pyramid network for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 1529–1538.
Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 6299–6308.
Diba A, Fayyaz M, Sharma V, Arzani MM, Yousefzadeh R, Gall J, Van Gool L. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision. 2018. p. 284–299.
Stroud J, Ross D, Sun C, Deng J, Sukthankar R. D3d: Distilled 3d networks for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020. p. 625–634.
Wei P, Sun H, Zheng N. Learning composite latent structures for 3d human action representation and recognition. IEEE Trans Multimedia. 2019;21(9):2195–208.
Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L. Temporal 3d convnets: New architecture and transfer learning for video classification. 2017. arXiv preprint arXiv:1711.08200.
He D, Zhou Z, Chuang Gan F, Li XL, Li Y, Wang L, Wen S. Stnet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 2019;33:8401–8.
Lin J, Gan C, Han S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 2019. p. 7083–7093.
Qiu Z, Yao T, Ngo CW, Tian X, Mei T. Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. p. 12056–12065.
Xie S, Sun C, Huang J, Tu Z, Murphy K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 2018. p. 305–321.
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 6450–6459.
Zolfaghari M, Singh K, Brox T. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer vision. 2018. p. 695–712.
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 2625–2634.
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. p. 1725–1732.
Sun L, Jia K, Yeung DY, Shi BE. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4597–4605.
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4507–4515.
Zhao Y, Xiong Y, Lin D. Trajectory convolution for action recognition. In Advances in Neural Information Processing Systems. 2018. p. 2204–2215.
Zhigang T, Li H, Zhang D, Dauwels J, Li B, Yuan J. Action-stage emphasized spatiotemporal vlad for video action recognition. IEEE Trans Image Process. 2019;28(6):2799–812.
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4489–4497.
Du Y, Wang W, Wang L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 1110–1118.
Li W, Wen L, Chang MC, Nam Lim S, Lyu S. Adaptive RNN tree for large-scale human action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 1444–1452.
Lev G, Sadeh G, Klein B, Wolf L. RNN fisher vectors for action recognition and image annotation. In Proceedings of the European Conference on Computer Vision. 2016. p. 833–850. Springer.
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst. 2016;28(10):2222–2232.
Huang Z, Xu W, Yu K. Bidirectional lstm-crf models for sequence tagging. 2015.
Liu J, Wang G, Hu P, Duan LY, Kot AC. Global context-aware attention LSTM networks for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 1647–1656.
Velickovic P, Cucurull G, Casanova A, Romero A. Pietro Lio, and Yoshua Bengio. Graph attention networks. 2018.
Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention. 2016.
Jian-Fang H, Zheng WS, Lai J, Zhang J. Jointly learning heterogeneous features for rgb-d activity recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(11):5344–52.
Feichtenhofer C, Fan H, Malik J, He K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 6202–6211.
Yang C, Xu Y, Shi J, Dai B, Zhou B. Temporal pyramid network for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020. p. 591–600.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 2818–2826.
Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell. 2017;40(6):1510–7.
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015.
Ma CY, Chen MH, Kira Z, AlRegib G. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. Signal Process Image Commun. 2019;71:76–87.
Kar A, Rai N, Sikka K, Sharma G. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 3376–3385.
Sun L, Jia K, Chen K, Yeung DY, Shi BE, Savarese S. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 2147–2156.
Zhou Y, Sun X, Zha ZJ, Zeng W. Mict: Mixed 3d/2d convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 449–458.
Zhu Y, Lan Z, Newsam S, Hauptmann A. Hidden two-stream convolutional networks for action recognition. In Asian Conference on Computer Vision. 2018. p. 363–378. Springer.
Liu Q, Che X, Bie M. R-stan: Residual spatial-temporal attention network for action recognition. IEEE Access. 2019;7:82246–55.
Sudhakaran S, Escalera S, Lanz O. Hierarchical feature aggregation networks for video action recognition. 2019. arXiv preprint arXiv:1905.12462.
Zhao J, Snoek CG. Dance with flow: Two-in-one stream action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. p. 9935–9944.
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 2017;31.
Chen Y, Kalantidis Y, Li J, Yan S, Feng J. Multi-fiber networks for video recognition. In Proceedings of The European Conference on Computer Vision. 2018. p. 352–367.
Fan Q, Chen CF, Kuehne H, Pistoia M, Cox D. More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. In Advances in Neural Information Processing Systems. 2019;32.
Chen Y, Fan H, Xu B, Yan Z, Kalantidis Y, Rohrbach M, Yan S, Feng J. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 3435–3444.
Feichtenhofer C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 203–213.
Kondratyuk D, Yuan L, Li Y, Zhang L, Tan M, Brown M, Gong B. Movinets: Mobile video networks for efficient video recognition. 2021. arXiv preprint arXiv:2103.11511.
Huang G, Bors AG. Video classification with finecoarse networks. 2021. arXiv preprint arXiv:2103.15584.
Funding
This study is funded in part by the Guangdong Province Science and Technology Plan Projects (2017B010110011), the National Natural Science Foundation of China (No. 62076005, 61906002), the National Natural Science Foundation of Anhui Province (2008085MF191, 2008085QF306, 1908085MF185), the Anhui Key Research and Development Plan (1804a09020101), and the University Synergy Innovation Program of Anhui Province (GXXT-2021-002).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
Informed consent was not required as no human or animals were involved.
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sun, D., Su, Z., Ding, Z. et al. Action Recognition with a Multi-View Temporal Attention Network. Cogn Comput 14, 1082–1095 (2022). https://doi.org/10.1007/s12559-021-09951-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-021-09951-5