Abstract
Action Recognition has been studied for many years. In recent years, there are some methods using 3D-CNN (C3D, I3D, R2 + 1D), which have high accuracy, but it is hard to train and quite time-consuming due to the network architecture of extracting spatial–temporal features and the huge action dataset. Since 2D-CNN has a pre-trained model with high accuracy and speed in object recognition, there is also a method of fine-tune it on Recurrent neural network (RNN), Long Short-Term Memory (LSTM) network and other network that can extract temporal features, but due to the poor performance of fine-tune, although the speed is increased, the accuracy has dropped significantly. Therefore, this research wants to use the high accuracy of 3D-CNN to distill 2D-CNN produce a great pre-trained model for action recognition and combine it with Attention Mechanism LSTM to make model on fine-tune on other action dataset can accelerate and achieve the accuracy of approximating 3D-CNN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2017)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)
Qiu, Z., Yao, T., Mei, T.: Learning spatio temporal representation with pseudo3d residual networks. In: ICCV, pp. 5534–5542 (2017)
Thung, G., Jiang, H.: A torch library for action recognition and detection using CNNs and LSTMs (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 4724–4733 (2017)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: CVPR, pp. 4489–4497 (2015)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. TPAMI 35(1), 221–231 (2012)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp. 6450–6459 (2018)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: CVPR, pp. 20–36 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild.arXiv: 1212.0402 (2012)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, SJ., Lin, CR., Lin, WT., Chen, JC. (2023). A Distilled 2D CNN-LSTM Framework with Temporal Attention Mechanism for Action Recognition. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2023. Communications in Computer and Information Science, vol 1863. Springer, Cham. https://doi.org/10.1007/978-3-031-42430-4_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-42430-4_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42429-8
Online ISBN: 978-3-031-42430-4
eBook Packages: Computer ScienceComputer Science (R0)