Human activity recognition (HAR) is complex in real time because of varying views, illuminations, backgrounds, and colors. With the current state of the art, deep learning (DL) algorithms are gaining more attention because of their automated feature extraction in contrast to the handcrafted machine learning (ML) methods. In this work, we aim to exploit a data fusion approach for HAR and propose an intermediate feature fusion approach for vision-based HAR employing convolutional neural networks (CNN2D) and transfer learning (TL) techniques with pretrained residual neural networks (ResNet50) for the extraction of local and global features, respectively. These extracted features are then fused employing a concatenation layer before classifying activities. We have focused on detecting two categories of activities: action (single person) and interactions (human–human and human-object). The proposed activity recognition is able to detect human activities in constrained as well as unconstrained environments with multiple viewpoints. The proposed work is evaluated with five benchmark vision datasets, namely, KTH, Weizmann, IXMAS, CASIA action database, and MSR Daily Activity 3D, in terms of accuracy and confusion matrix. This proposed framework is able to recognize complex activities with better accuracy than single-person-based activities seen in the MSR Daily Activity 3D and CASIA datasets, gaining the highest accuracy of 99.94% and 99.76%, respectively. The comparative analysis with the existing state-of-the-art methods shows the superiority of the performance of the proposed model in terms of accuracy.
Similar content being viewed by others
Data Availability
The data associated with this work will be provided on a reasonable request.
Saleem G, Bajwa UI, Raza RH. Toward human activity recognition: a survey. Neural Comput Appl. 2023;35(5):4145–82.
Dang LM, Min K, Wang H, Piran MJ, Lee CH, Moon H. Sensor-based and vision-based human activity recognition: a comprehensive survey. Pattern Recogn. 2020;108: 107561.
Zhou H, Zhao Y, Liu Y, Lu S, An X, Liu Q. Multi-sensor data fusion and CNN-LSTM model for human activity recognition system. Sensors. 2023;23(10):4750.
Vishwakarma S, Agrawal A. A survey on activity recognition and behaviour understanding in video surveillance. Vis Comput. 2013;29:983–1009.
Beddiar DR, Nini B, Sabokrou M, Hadid A. A Vision-based human activity recognition: a survey. Multimedia Tools Appl. 2020;79:30509–55.
Subasi A, Khateeb K, Brahimi T, Sarirete A. Human activity recognition using machine methods in a healthcare environment. Innovation in health Informatics Academic Press. 2020; 123–144.
Girdhar P, Johri P, Virmani D. Vision based human activity recognition: a comprehensive review of method & techniques. Turkish J Comp Math Educ. 2021;12:7383–94.
Ding R, Li X, Nei L, Li J, Si X, Chu D, Lui G, Zhan D. Empirical study and improvement on deep transfer learning for human activity recognition. Sensors. 2018;19:57.
Adama DA, Lotfi A, Ranson R. A survey of vision-based transfer learning in human activity recognition. Electronics. 2021;10:2412.
Islam M, Nooruddin S, Karray F, Muhammad G. Human activity recognition using tools of convolutional neural networks: a state of the art review data sets challenges and future prospects. Comp Biol Med. 2022;149:106060.
Li Z, Liu F, Yang W, Peng S, Zhou J. A survey of convolutional neural networks: analysis applications and prospects. IEEE Trans Neural Netw Learn Syst. 2021;33(12):6999–7019.
Xie J, Xin W, Liu R, Miao Q, Sheng L, Zhang L, Gao X. Global co-occurrence feature and local spatial feature learning for skeleton-based action recognition. Entropy. 2020;20:1135.
Zhang Y, Yin Y, Wang Y, Ai J, Wu D. CSI-based location -independent human activity recognition with parallel convolutional networks. Comput Commun. 2023;197:87–95.
Tuncer T, Ertam F, Dogan S, Aydemir E, Plawiak P. Ensemble residual network-based gender and activity method with signals. J Supercomput. 2020;76:2119–38.
Boulahia SY, Amamra A, Madi MR, Daikh S. Early intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach Vis Appl. 2021;32:1–18.
Gadzicki K, Khamsehashari R, Zetzsche C. Early vs late fusion in multimodal convolutional neural networks. In 2020 IEEE 23rd International Conference on Information Fusion (FUSION). 2020; 1–6.
Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J. Human action recognition for various data modalities: A review. IEEE Trans Pattern Anal Mach Intell. 2022;45:3200–25.
Aguileta A, Brena RF, Mayora O, Molino-Minero-Re E, Trejo LA. Multi-sensor fusion for activity recognition-a survey. Sensors. 2019;19:3803.
Khan MA, Javed K, Khan SA, Saba T, Habib U, Khan JA, Abbasi AA. Human action recognition using fusion of multiview and deep features: an application to video surveillance. Multimedia Tools Appl. 2024;83:14885–911.
Franco A, Maio MA, D,. A multimodal approach for human activity recognition based on skeleton and RGB data. Pattern Recogn Lett. 2020;131:293–9.
Zhang S, Wei Z, Nei J, Huang L, Wang S, Li Z. A review on human activity recognition using vision-based method. J Healthcare Eng. 2017;2017:1–31.
Hussain Z, Sheng M, Zhang WE. Different approaches for human activity recognition: a survey. arXiv preprint arXiv: 190605074. 2019
Oh S, Ashiquzzaman A, Lee D, Kim Y, Kim J. Study on human activity recognition using semi-supervised active transfer learning. Sensors. 2021;21:2760.
Al-Faris M, Chiverton J, Ndzi D, Ahmed A. A review on computer vision-based methods for human action recognition. Journal of Imaging. 2020;6:46.
Jegham I, Khalifa AB, Alouani I, Mahjoub MA. Vision-based human action recognition an overview and real world challenges. Forensic Sci Int Digital Investig. 2020;32: 200901.
Ray A, Kolekar MH, Balasubramanian R, Hafiane A. Transfer learning enhanced vision-based human activity recognition a decade-long analysis. Int J Inform Management Data Insights. 2023;3:100142.
Gupta N, Gupta SK, Pathak RK, Jain V, Rashidi P, Suri JS. Human activity recognition in artificial intelligence framework: a narrative review. Artif Intell Rev. 2022;55(6):4755–808.
Qui S, Zhao H, Jiang N, Wang Z, Lui L, An Y, Zhao H, Miao X, Lui R, Fortino G. Multi-sensor information fusion based on machine learning for real applications in human activity recognition: state-of-the-art and research challenges. Inform Fusion. 2022;80:241–65.
Shiranthika C, Premakumara N, Chui HL, Samani H, Shyalika C, Yang CY. Human activity recognition using CNN & LSTM. In 5th International Conference on Information Technology Research (ICITR). 2020; 1–6.
Nweke HF, The YW, Mujtaba G, Al-Garadi MA. Data fusion and multiple classifier systems for human activity detection and health monitoring: review and open research direction. Inform Fus. 2019;46:147–70.
Uddin MA, Lee YK. Feature fusion of deep spatial features and handcrafted spatiotemporal features for human action recognition. Sensors. 2019;19:1599.
Naveed H, Khan G, Khan AU, Siddiqi A, Khan MUG. Human activity recognition using mixture of heterogeneous features and sequential minimal optimization. Int J Mach Learn Cybern. 2019;10:2329–40.
Mahajan RC, Pathare NK, Vyas V. Video-based anomalous activity detection using 3D-CNN and transfer learning. In IEEE 7th International Conference for Convergence in Technology (I2CT). 2022; 1–6.
Zamri NM, Ling GF, Han PY, Yin OS. Vision-based human action recognition on pretrained AlexNet. In 9th IEEE International conference on Control System Computing and Engineering (ICCSCE). 2019; 1–5.
Vishwakarma DK, Dhiman C. A unified model for human activity recognition using spatial distribution of gradients and difference of gaussian kernel. Vis Comput. 2019;35:1595–613.
Huan RH, Xie CJ, Guo F, Chi KK, Mao KJ, Li YL, Pan Y. Human action recognition based on HOIRM feature fusion and AP clustering BOW. PloSone. 2019;14: e0219910.
Nida N, Yousaf MH, Irtaza A, Velastin SA. Video augmentation technique for human action recognition using genetic algorithm. ETRI J. 2022;44:327–38.
Malik NUR, Abu-Baker SAR, Sheikh UU, Channa A, Popescu N. Cascading pose features with CNN-LSTM for multiview human action recognition. Signals. 2023;4:40–55.
Abdelbaky A, Aly S. Two-stream spatiotemporal feature fusion for human action recognition. Vis Comput. 2021;37:1821–35.
Gupta S, Vishwakarma DK, Puri NK. Leveraging human segmentation guided frames in videos for activity recognition. In 6th International Conference on Computing Methodologies and Communication (ICCMC). 2022; 1406–1411.
Verma KK, Singh BM. Deep multi-model fusion for human activity recognition using evolutionary algorithms. Int J Interactive Multimedia Artif Intell. 2021;7(2):44.
Nigam S, Singh R, Singh MK, Singh VK. Multiview human activity recognition using uniform rotation invariant local binary patterns. J Ambient Intell Human Comput. 2022;14(5):4707–25.
Vo VH, Pham HM. Multiple modal features and multiple kernel learning for human daily activity recognition. Sci Technol Develop J. 2018;21:52–63.
Basly H, Quarda W, Sayadi FE, Ouni B, Alimi AM. DTR-HAR deep temporal residual representation for human activity recognition. The Visual Computer. 2022; 1–21.
Ahad MAR. Action datasets and MHI. Motion History Images for Action recognition and understanding. Singapore: Springer; 2013. p. 77–85.
MSR DailyActivity3D. Dr Wanqing Li (UOW). Available: https://sites.google.com/view/wanqingli/data-sets/msr-dailyactivity3d.
Centre for Biometrics and Security Research. Available: http://www.cbsr.ia.ac.cn/english/Action%20Databases%20EN.asp.
Malik Z, Shapiai MIB. Human action interpretation using convolutional neural network: a survey. Mach Vis Appl. 2022;33:1–23.
Islam MS, Okita T, Inoue S. Evaluation of transfer learning for human activity recognition among different datasets. In IEEE International Conference on Dependable Autonomic and Secure Computing International Conference on Pervasive intelligence and Computing Intl Conf on Cloud and Big Data Computing International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). 2019; 854–859.
Wen L, Li X, Gao L. A transfer convolutional neural network for fault diagnosis based on ResNet-50. Neural Comput Appl. 2020;32:6114–5.
Boesch G. Deep Residual Networks (ResNet ResNet50) - 2024 Guide. visoai. 2023. Available: https://viso.ai/deep-learning/resnet-residual-neural-network/.
Khelalef A, Ababsa F, Benoudjit N. An efficient human activity recognition technique based on deep learning. Pattern Recognit Image Anal. 2019;29:702–15.
Feng W, Feng Y. Research on action recognition based on deep learning with long short-term memory network and attention mechanism. Wirel Commun Mobile Comput. 2022;2022:1–9.
Kumar R, Sagar LK, Awasthi S. Human activity recognition from video clip. Intelligent Computing in Engineering. 2020; 269–274.
Jaouedi N, Boujnah N, Bouhlel MS. A new hybrid deep learning model for human action recognition. J King Saud Univ Comp Inform Sci. 2020;32:447–53.
Basha SH, Pulabaigari V, Mukherjee S. An information-rich sampling technique over spatio-temporal CNN for classification of human action in videos. Multimedia Tools Appl. 2022;81:40431–49.
Roselind Johnson D, Uthariaraj VR. A novel parameter initialization technique using RBM-NN for human action recognition. Comput Intell Neurosci. 2020;2020:1–30.
Garg A, Nigam S, Singh R. Vision-based human activity recognition using hybrid deep learning In IEEE International Conference on Connected Systems and Intelligence (CSI). 2022; 1- 6.
Han PY, Yee KE, Yin OS. Localised representation in human action recognition. In Proceedings of the 2018 VII International Conferences on Network Communication and Computing. 2018; 261–266.
Khater S, Hadhoud M, Fayak MB. A novel human activity recognition architecture: using residual inception ConvLSTM layer. J Eng Appl Sci. 2022;69:1–16.
Patel CI, Labana D, Pandya S, Modi K, Ghayvat H, Awais M. Histogram of oriented gradient-based fusion of features for human action recognition in action video sequences. Sensors. 2020;20:7299.
Snoun A, Jlidi N, Bouchrika T, Jemai O, Zaied M. Towards a deep human activity recognition approach based on video to image transformation with skeleton data. Multimedia Tools Appl. 2021;80:29675–98.
D’ Sa G, Prasad BG. An IoT based framework for activity recognition using deep learning technique. arXiv preprint arXiv: 190607247. 2019.
Berlin SJ, John M. Particle swarm optimization with deep learning for human action recognition. Multimedia Tools Appl. 2020;79:17349–71.
Abdelbaky A, Aly S. Human action recognition using three orthogonal planes with unsupervised deep convolutional neural network. Multimedia Tools Appl. 2021;80:20019–43.
Hua G, Hemantha Kumar G, Manjunath Aradhya VN. A hybrid speed and radial distance feature descriptor using optical flow approach in HAR. Applied Intelligence and Informatics: Second International Conference All 2022 Proceedings. Springer Nature Switzerland. 2023; 1-13.
Nida N, Yousaf MH, Irtaza A, Velastin SA. Instructor activity recognition through deep spatiotemporal features and feedforward extreme learning machine. Math Prob Eng. 2019. https://doi.org/10.1155/2019/2474865.
Nadeem A, Jalal A, Kim K. Automatic human pose estimation for sport activity recognition with robust body part detection and entropy markov model. Multimedia Tools Appl. 2021;80:21465–98.
Mishra O, Kavimandan PS, Tripathi MM, Kapoor R, Yadav K. Human action recognition using a new hybrid descriptor. In Advances in VLSI Communication and Signal Processing. 2021; 527–536.
Nida N, Yousaf MH, Irtaza A, Velastin SA. Deep temporal motion descriptor (DTMD) for human action recognition. Turk J Electr Eng Comput Sci. 2020;28:1371–85.
Goyal G, Noceti N, Odone F. Single view learning in action recognition. In 25th International Conference on Pattern Recognition (ICPR). 2021; 3690–3697.
Marshella A, Goyal G, Odone F. Adversarial feature refinement for cross-view action recognition. In Proceedings of the 36th Annual ACM Symposium on Applied Computing. 2021; 1046–1054.
Naeem HB, Murtaza F, Yousaf MH, Velastin SA. T-VALD: Temporal vector of locally aggregated descriptor for multiview human action recognition. Pattern Recogn Lett. 2021;148:22–8.
Xu C, Wu X, Li Y, Jin Y, Wang M, Lui Y. Cross-modality online distillation for multi-view action recognition. Neurocomputing. 2021;456:384–93.
Malik NUR, Sheikh UU, Abu-Baker SAR, Channa A. Multi-view human action recognition using skeleton based-FineKNN with extraneous frame scrapping technique. Sensors. 2023;23:2745.
Zhang J, Bai F, Zhao J, Song Z. Multi-views action recognition on 3D ResNet-LSTM framework. In IEEE 2nd International Conference on Big Data Artificial Intelligence Internet of Things Engineering (ICBAIE). 2021; 289–293.
Nigam S, Singh R, Singh MK, Singh VK. Multiple views based recognition of human activities using uniform patterns. In 6th International Conference on Image Information Processing (ICIIP) 2021; 6: 483–488.
Basly H, Ouarda W, Sayadi FE, Ouni B, Alimi AM. CNN-SVM learning approach based human activity recognition. In International Conference on Image and Signal Processing. 2020; 271–281.
Debnath B, O’Brient M, Kumar S, Behera A. Attention-driven body pose encoding for human activity recognition. In 25th International Conference on Pattern Recognition (ICPR). 2021; 5897–5904.
Islam MS, Bakhat K, Khan R, Iqbal M, Islam MM, Ye Z. Action recognition using interrelationships of 3D joints and frames based on angle sine relation and distance features using interrelationships. Appl Intell. 2021;51:6001–13.
Lui A, Xu N, Nie WZ, Su YT, Zhang YD. Multi-domain and multi-task learning for human action recognition. In IEEE Trans Image Process. 2018;28:853–67.
Singh T, Vishwakarma DK. A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput Appl. 2021;33:469–85.
Khowaja SA, Lee SL. Skeleton-based human action recognition with sequential convolutional-LSTM networks and fusion strategies. J Ambient Intell Human Comput. 2022;13(8):3729–46.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that there is no conflict of interest regarding this manuscript and received no funding for this research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Garg, A., Nigam, S. & Singh, R. An Intermediate Deep Feature Fusion Approach for Understanding Human Activities from Image Sequences. SN COMPUT. SCI. 5, 1037 (2024). https://doi.org/10.1007/s42979-024-03345-8
DOI: https://doi.org/10.1007/s42979-024-03345-8