Abstract
Human activity recognition (HAR) is necessary in numerous fields, involving medicine, sports, and security. Traditional HAR methods often rely on complex feature extraction from raw input data, while convolutional neural networks (CNN) are primarily designed for 2D data. The proposed approach seeks to overcome these limitations by leveraging both spatial and temporal attributes for improved action detection and enhancing the understanding of human movements across adjacent frames. This research aims to address the challenges of HAR by introducing a new model that combines a 3D CNN architecture with an attention layer. A 3D convolution transformer is employed to capture intricate spatial and temporal features, generate multiple data channels from input frames, and optimize performance through regularization and model ensemble techniques. The main findings reveal outstanding results on benchmark datasets, with an accuracy of 98.09% and 99.09% on the Weizmann and UCF101 datasets, respectively. These results underscore the model's effectiveness in accurately identifying human activities in movie-based natural environments.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data associated with this work will be provided on a reasonable request.
References
D’Arco L, Wang H, Zheng H (2023) DeepHAR: a deep feed-forward neural network algorithm for smart insole-based human activity recognition. Neural Comput Appl 35:13547–13563. https://doi.org/10.1007/s00521-023-08363-w
Kushwaha A, Khare A, Prakash O (2023) Micro-network-based deep convolutional neural network for human activity recognition from realistic and multi-view visual data. Neural Comput Appl 35:13321–13341. https://doi.org/10.1007/s00521-023-08440-0
Nguyen HP, Ribeiro B (2023) Video action recognition collaborative learning with dynamics via PSO-ConvNet Transformer. Sci Rep 13:14624. https://doi.org/10.1038/s41598-023-39744-9
Saoudi EM, Jaafari J, Andaloussi SJ (2023) Advancing human action recognition: a hybrid approach using attention-based LSTM and 3D CNN. Sci Afr 21:e01796. https://doi.org/10.1016/j.sciaf.2023.e01796
Surek GA, Seman LO, Stefenon SF, Mariani VC, Coelho LD (2023) Video-based human activity recognition using deep learning approaches. Sensors. https://doi.org/10.3390/s23146384
Zhang H, Wang L, Sun J (2023) Exploiting spatio-temporal knowledge for video action recognition. IET Comput Vision 17:222–230. https://doi.org/10.1049/cvi2.12154
Zhu S, Chen W, Liu F, Zhang X, Han X (2023) Human activity recognition based on a modified capsule network. Mob Inf Syst 2023:8273546. https://doi.org/10.1155/2023/8273546
Tyagi B, Nigam S, Singh R (2022) A review of deep learning techniques for crowd behavior analysis. Arch Comput Method Eng 29(7):5427–5455
Umar IM, Ibrahim KM, Gital AYU, Zambuk FU, Lawal MA, Yakubu ZI (2022) Hybrid model for human activity recognition using an inflated i3-D two stream convolutional-LSTM network with optical flow mechanism. In: 2022 IEEE Delhi section conference, DELCON 2022. https://doi.org/10.1109/DELCON54057.2022.9752782.
Nigam S, Singh R, Singh MK, Singh VK (2023) Multiview human activity recognition using uniform rotation invariant local binary patterns. J Ambient Intell Humaniz Comput 14(5):4707–4725
Manaf FA, Singh S (2021) A novel hybridization model for human activity recognition using stacked parallel LSTMs with 2D-CNN for feature extraction. In: 2021 12th International conference on computing communication and networking technologies (ICCCNT), pp 1–7. https://doi.org/10.1109/ICCCNT51525.2021.9579686
Nigam S, Singh R, Misra AK (2019) A review of computational approaches for human behavior detection. Arch Comput Method Eng 26:831–863
Rodríguez-Moreno I, Martínez-Otzeta JM, Sierra B, Rodriguez I, Jauregi E (2019) Video activity recognition: state-of-the-art. Sensors (Switzerland) 19:1–25. https://doi.org/10.3390/s19143160
Xia K, Huang J, Wang H (2020) LSTM-CNN architecture for human activity recognition. IEEE Access 8:56855–56866. https://doi.org/10.1109/ACCESS.2020.2982225
Fereidoonian F, Firouzi F, Farahani B (2020) Human Activity recognition: from sensors to applications. In: 2020 International conference on omni-layer intelligent systems, COINS 2020. https://doi.org/10.1109/COINS49042.2020.9191417
Ehatisham-Ul-Haq M, Javed A, Azam MA, Malik HMA, Irtaza A, Lee IH, Mahmood MT (2019) Robust human activity recognition using multimodal feature-level fusion. IEEE Access 7:60736–60751. https://doi.org/10.1109/ACCESS.2019.2913393
Muaaz M, Chelli A, Abdelgawwad AA, Mallofré AC, Pätzold M (2020) WiWeHAR: Multimodal human activity recognition using Wi-Fi and wearable sensing modalities. IEEE Access 8:164453–164470. https://doi.org/10.1109/ACCESS.2020.3022287
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li FF (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223
Soomro K, Zamir AR, Shah M (2012) UCF101: A Dataset of 101 human actions classes from videos in the wild
Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20:634–644. https://doi.org/10.1109/TMM.2017.2749159
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream ConvNets, pp 1–5
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 4305–4314. https://doi.org/10.1109/CVPR.2015.7299059
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. 2016-Decem, pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S(2016) Dynamic image networks for action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. 2016-Decem, pp 3034–3042. https://doi.org/10.1109/CVPR.2016.331
Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017. 2017-Janua, pp 4724–4733. https://doi.org/10.1109/CVPR.2017.502
Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW (2017) Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6:1155–1166. https://doi.org/10.1109/ACCESS.2017.2778011
Chen J, Xu Y, Zhang C, Xu Z, Meng X, Wang J (2019) An improved two-stream 3D convolutional neural network for human action recognition. In: 2019 25th International conference on automation and computing (ICAC), pp 1–6. https://doi.org/10.23919/IConAC.2019.8894962
Tanberk S, Kilimci ZH, Tukel DB, Uysal M, Akyokus S (2020) A hybrid deep model using deep learning and dense optical flow approaches for human activity recognition. IEEE Access 8:19799–19809. https://doi.org/10.1109/ACCESS.2020.2968529
Gatt T, Seychell D, Dingli A (2019) Detecting human abnormal behaviour through a video generated model. In: International symposium on image and signal processing and analysis, ISPA. 2019-Septe, pp 264–270. https://doi.org/10.1109/ISPA.2019.8868795
Zheng Y, Liu Q, Chen E, Ge Y, Zhao JL (2014) Time series classification using multi-channels deep convolutional neural networks. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 8485 LNCS, pp 298–310. https://doi.org/10.1007/978-3-319-08010-9_33
Ordóñez FJ, Roggen D (2016) Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors. https://doi.org/10.3390/s16010115
Mishra P, Dey S, Ghosh SS, Seal DB, Goswami S (2019) Human Activity Recognition using Deep Neural Network. In: 2019 International conference on data science and engineering (ICDSE). pp. 77–83. https://doi.org/10.1109/ICDSE47409.2019.8971476
Khimraj, Shukla, PK, Vijayvargiya A, Kumar R (2020) Human Activity Recognition using Accelerometer and Gyroscope Data from Smartphones. In: Proceedings - 2020 international conference on emerging trends in communication, control and computing, ICONC3 2020. https://doi.org/10.1109/ICONC345789.2020.9117456
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543
Chen C, Jafari R, Kehtarnavaz, N (2016) Fusion of depth, skeleton, and inertial data for human action recognition. In: 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp. 2712–2716. https://doi.org/10.1109/ICASSP.2016.7472170
Li K, Zhao X, Bian J, Tan M (2017) Sequential learning for multimodal 3D human activity recognition with Long-Short Term Memory. IN: 2017 IEEE International conference on mechatronics and automation, ICMA 2017, pp 1556–1561. https://doi.org/10.1109/ICMA.2017.8016048
Fuad Z, Unel M (2018) Human action recognition using fusion of depth and inertial sensors. Springer, Berlin. https://doi.org/10.1007/978-3-319-93000-8_42
Manzi A, Moschetti A, Limosani R, Fiorini L, Cavallo F (2018) Enhancing activity recognition of self-localized robot through depth camera and wearable sensors. IEEE Sens J 18:9324–9331. https://doi.org/10.1109/JSEN.2018.2869807
Sefen B, Baumbach S, Dengel A, Abdennadher S (2016) Human activity recognition using sensor data of smartphones and smartwatches. In: ICAART 2016 - Proceedings of the 8th international conference on agents and artificial intelligence. 2, pp 488–493. https://doi.org/10.5220/0005816004880493
Bharti P, De D, Chellappan S, Das SK (2019) HuMAn: Complex activity recognition with multi-modal multi-positional body sensing. IEEE Trans Mob Comput 18:857–870. https://doi.org/10.1109/TMC.2018.2841905
Martiez-Gonzalez A, Villamizar M, Canevet O, Odobez JM (2018) Real-time convolutional networks for depth-based human pose estimation. In: IEEE International conference on intelligent robots and systems, pp 41–47. https://doi.org/10.1109/IROS.2018.8593383
Mohammad AN, Ohashi H, Ahmed S, Nakamura K, Akiyama T, Sato T, Nguyen P, Dengel A (2018) Hierarchical model for zero-shot activity recognition using wearable sensors. In: ICAART 2018 - Proceedings of the 10th international conference on agents and artificial intelligence. 2, pp 478–485. https://doi.org/10.5220/0006595204780485
Cruciani F, Sun C, Zhang S, Nugent C, Li C, Song S, Cheng C, Cleland I, McCullagh P (2019) A public domain dataset for human activity recognition in free-living conditions. In: Proceedings - 2019 IEEE SmartWorld, ubiquitous intelligence and computing, advanced and trusted computing, scalable computing and communications, internet of people and smart city innovation, SmartWorld/UIC/ATC/SCALCOM/IOP/SCI 2019, pp 166–171. https://doi.org/10.1109/SmartWorld-UIC-ATC-SCALCOM-IOP-SCI.2019.00071
Chavarriaga R, Sagha H, Calatroni A, Digumarti ST, Tröster G, Millán JDR, Roggen D (2013) The opportunity challenge: a benchmark database for on-body sensor-based activity recognition. Pattern Recogn Lett 34:2033–2042. https://doi.org/10.1016/j.patrec.2012.12.014
Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings - international conference on image processing, ICIP. 2015-Decem, pp 168–172. https://doi.org/10.1109/ICIP.2015.7350781
Nigam S, Singh R, Singh MK, Singh VK (2021) Multiple views-based recognition of human activities using uniform patterns. In: 2021 Sixth international conference on image information processing (ICIIP), Vol. 6, pp. 483–488. IEEE
Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2013) Berkeley MHAD: A comprehensive Multimodal Human Action Database. In: Proceedings of IEEE Workshop on applications of computer vision, pp 53–60. https://doi.org/10.1109/WACV.2013.6474999
Shreyas DG, Raksha S, Prasad BG (2020) Implementation of an anomalous human activity recognition system. SN Comput Sci 1:1–10. https://doi.org/10.1007/s42979-020-00169-0
Chen K, Zhang D, Yao L, Guo B, Yu Z, Liu Y (2021) Deep learning for sensor-based human activity recognition: overview, challenges, and opportunities. ACM Comput Surv. https://doi.org/10.1145/3447744
Sun J, Fu Y, Li S, He J, Xu C, Tan L (2018) Sequential human activity recognition based on deep convolutional network and extreme learning machine using wearable sensors. J Sens. https://doi.org/10.1155/2018/8580959
Yadav SK, Tiwari K, Pandey HM, Akbar SA (2021) A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions. Knowl-Based Syst 223:106970. https://doi.org/10.1016/j.knosys.2021.106970
Kalfaoglu ME, Kalkan S, Alatan AA (2020) Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). 12539 LNCS, pp 731–747. https://doi.org/10.1007/978-3-030-68238-5_48
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding this manuscript and received no funding for this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Pareek, G., Nigam, S. & Singh, R. Modeling transformer architecture with attention layer for human activity recognition. Neural Comput & Applic 36, 5515–5528 (2024). https://doi.org/10.1007/s00521-023-09362-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09362-7