Hybrid handcrafted and learned feature framework for human action recognition

Chaolong Zhang^1,2,
Yuanping Xu²,
Zhijie Xu¹,
Jian Huang² &
…
Jun Lu²

949 Accesses
13 Citations
2 Altmetric
Explore all metrics

Abstract

Recognising human actions in video is a challenging task in real-world. Dense trajectory (DT) offers accurate recording of motions over time that is rich in dynamic information. However, DT models lack the mechanism to distinguish dominant motions from secondary ones over separable frequency bands and directions. By contrast, deep learning-based methods are promising over the challenge though still suffering from limited capacity in handling complex temporal information, not mentioning huge datasets needed to guide the training. To take the advantage of semantical meaningful and “handcrafted” video features through feature engineering, this study integrates the discrete wavelet transform (DWT) technique into the DT model for gaining more descriptive human action features. Through exploring the pre-trained dual-stream CNN-RNN models, learned features can be integrated with the handcrafted ones to satisfy stringent analytical requirements within the spatial-temporal domain. This hybrid feature framework generates efficient Fisher Vectors through a novel Bag of Temporal Features scheme and is capable of encoding video events whilst speeding up action recognition for real-world applications. Evaluation of the design has shown superior recognition performance over existing benchmark systems. It has also demonstrated promising applicability and extensibility for solving challenging real-world human action recognition problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A Novel 3D Human Action Recognition Framework for Video Content Analysis

Two-stream spatiotemporal feature fusion for human action recognition

Article 09 August 2020

Learning multi-temporal-scale deep information for action recognition

Article 01 December 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bolovinou A, Pratikakis I, Perantonis S (2013) Bag of spatio-visual words for context inference in scene classification. Pattern Recognition 46(3):1039–1053. https://doi.org/10.1016/j.patcog.2012.07.024
Article Google Scholar
Chandra MA, Bedi SS (2018) Survey on SVM and their application in image classification. International Journal of Information Technology pp 1–11. https://doi.org/10.1007/s41870-017-0080-1
Chang J, Wang L, Meng G, Xiang S, Pan C (2017) Deep Adaptive Image Clustering. In: International Conference on Computer Vision. IEEE, pp 5880–5888. https://doi.org/10.1109/ICCV.2017.626
Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp 177–186. IEEE. https://doi.org/10.1109/WACV.2017.27
Ji S, Xu W, Yang M, Yu K (2013) 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1):221–231. https://doi.org/10.1109/TPAMI.2012.59
Article Google Scholar
Jiang J, Deng C, Cheng X (2017) Action prediction based on dense trajectory and dynamic image. In: Chinese Automation Congress. IEEE, pp 1175–1180 https://doi.org/10.1109/CAC.2017.8242944
Jin S, Su H, Stauffer C, LearnedMiller E (2017) End-to-End Face Detection and Cast Grouping in Movies Using Erdös-Rényi Clustering. In: International Conference on Computer Vision. IEEE, pp 5286–5295 https://doi.org/10.1109/ICCV.2017.564
Ju S, Xiao W, Shuicheng Y, LoongFah C, Tat-Seng C, Jintao L (2009) Hierarchical spatio-temporal context modeling for action recognition. In: Conference on Computer Vision and Pattern Recognition. IEEE, pp 2004–2011 https://doi.org/10.1109/CVPRW.2009.5206721
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L(2014) Large-Scale Video Classification with Convolutional Neural Networks. In: Computer Vision and Pattern Recognition. IEEE, pp 1725–1732 https://doi.org/10.1109/CVPR.2014.223
Kieu T, Vo B, Le T, Deng ZH, Le B (2017) B: Mining top-k co-occurrence items with sequential pattern. Expert Systems with Applications 85:123–133. https://doi.org/10.1016/j.eswa.2017.05.021
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Communications of the ACM 60(6):84–90. https://doi.org/10.1145/3065386
Article Google Scholar
Kuehne H, Jhuang H (2011) HMDB: A large video database for human motion recognition. In: International Conference on Computer Vision. IEEE, pp 2556–2563 https://doi.org/10.1109/ICCV.2011.6126543
Laptev I (2005) On Space-Time Interest Points. International Journal of Computer Vision 64(2):107–123. https://doi.org/10.1007/s11263-005-1838-7
Article MathSciNet Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp 1–8 https://doi.org/10.1109/CVPR.2008.4587756
Lee DG, Lee SW (2019) Prediction of partially observed human activity based on pre-trained deep representation. Pattern Recognition 85:198–206. https://doi.org/10.1016/j.patcog.2018.08.006
Article Google Scholar
Li W, Wen L, Chang M, Lim SN, Lyu S (2017) Adaptive RNN Tree for Large-Scale Human Action Recognition. In: International Conference on Computer Vision. IEEE, pp 1453–1461 https://doi.org/10.1109/ICCV.2017.161
Liu J, Jiebo Luo, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Conference on Computer Vision and Pattern Recognition. IEEE, pp 1996–2003 https://doi.org/10.1109/CVPR.2009.5206744
Liu P, Wang J, She M, Liu H (2011) Human action recognition based on 3D SIFT and LDA model. In: Workshop on Robotic Intelligence In Informationally Structured Space, pp 12–17 https://doi.org/10.1109/RIISS.2011.5945790
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C, Berg AC (2016) SSD: Single Shot MultiBox Detector. In: European Conference on Computer Vision, pp 21–37 https://doi.org/10.1007/978-3-319-46448-0_2
Lu X, Yao H, Zhao S, Sun X, Zhang S (2019) Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors. Multimedia Tools and Applications 78(1):507–523. https://doi.org/10.1007/s11042-017-5251-3
Article Google Scholar
Majd M, Safabakhsh R (2020) Correlational Convolutional LSTM for human action recognition. Neurocomputing 396:224–229. https://doi.org/10.1016/j.neucom.2018.10.095
Article Google Scholar
Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: International Conference on Computer Vision. IEEE, pp 104–111 https://doi.org/10.1109/ICCV.2009.5459154
Murtaza F, HaroonYousaf M, Velastin SA (2018) DA-VLAD: Discriminative Action Vector of Locally Aggregated Descriptors for Action Recognition. In: IEEE International Conference on Image Processing (ICIP). IEEE, pp 3993–3997 https://doi.org/10.1109/ICIP.2018.8451255
Peng X, Zou C, Qiao Y, Peng Q (2014) Action Recognition with Stacked Fisher Vectors. In: European Conference on Computer Vision. Springer, pp 581–595 https://doi.org/10.1007/978-3-319-10602-1_38
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You Only Look Once: Unified, Real-Time Object Detection. In: Computer Vision and Pattern Recognition. IEEE, pp 779–788 https://doi.org/10.1109/CVPR.2016.91
Ryoo MS (2011) Human activity prediction: Early recognition of ongoing activities from streaming videos. In: International Conference on Computer Vision. IEEE, pp 1036–1043 https://doi.org/10.1109/ICCV.2011.6126349
Ryoo MS, Aggarwal JK (2010) UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA). https://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
Sargano A, Angelov P, Habib Z (2017) A Comprehensive Review on Handcrafted and Learning-Based Action Representation Approaches for Human Activity Recognition. Applied Sciences 7(1):110. https://doi.org/10.3390/app7010110
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-Stream Convolutional Networks for Action Recognition in Videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. In: International Conference on Learning Representations, pp 769–784
Sipiran I, Bustos B (2011) Harris 3D: A robust extension of the Harris operator for interest point detection on 3D meshes. In: Visual Computer, p. 963–976 https://doi.org/10.1007/s00371-011-0610-y
Steiner B, DeVito Z, Chintala S, Gross S, Paszke A, Massa F, Lerer A, Chanan G, Lin Z, Yang E, Desmaison A, Tejani A, Kopf A, Bradbury J, Antiga L, Raison M, Gimelshein N, Chilamkurthy S, Killeen T, Fang L, Bai J (2019) PyTorch: An Imperative Style. Advances in Neural Information Processing Systems (NIPS), High-Performance Deep Learning Library. In
Google Scholar
Sun J, Mu Y, Yan S, Cheong LF (2010) Activity recognition using dense long-duration trajectories. In: International Conference on Multimedia and Expo. IEEE, pp 322–327 https://doi.org/10.1109/ICME.2010.5583046
Tao M, Bai J, Kohli P, Paris S (2012) SimpleFlow: A non-iterative, sublinear optical flow algorithm. Computer Graphics Forum 31(2):345–353. https://doi.org/10.1111/j.1467-8659.2012.03013.x
Article Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. In: International Conference on Computer Vision, 1. IEEE, pp 4489–4497 https://doi.org/10.1109/ICCV.2015.510
Van Droogenbroeck M, Barnich O (2014) ViBe: A Disruptive Method for Background Subtraction. In: Background Modeling and Foreground Detection for Video Surveillance. Chapman and Hall/CRC, pp 7.1–7.23 https://doi.org/10.1201/b17223-10
Varol G, Laptev I, Schmid C (2018) Long-Term Temporal Convolutions for Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(6):1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608
Article Google Scholar
Vishwakarma DK, Kapoor R (2015) Hybrid classifier based human activity recognition using the silhouette and cells. Expert Systems with Applications 42(20):6957–6965. https://doi.org/10.1016/j.eswa.2015.04.039
Article Google Scholar
Wan Y, Yu Z, Wang Y, Li X (2020) Action Recognition Based on Two-Stream Convolutional Networks with Long-Short-Term Spatiotemporal Features. IEEE Access 8:85284–85293. https://doi.org/10.1109/ACCESS.2020.2993227
Article Google Scholar
Wang H, Kläser A, Schmid C, Liu C (2013) Dense Trajectories and Motion Boundary Descriptors for Action Recognition. International Journal of Computer Vision 103(1):60–79. https://doi.org/10.1007/s11263-012-0594-8
Article MathSciNet Google Scholar
Wang H, Schmid C (2013) Action Recognition with Improved Trajectories. In: International Conference on Computer Vision. IEEE, pp 3551–3558 https://doi.org/10.1109/ICCV.2013.441
Wang J, Xu Z (2013) STV-based video feature processing for action recognition. Signal Processing 93(8):2151–2168. https://doi.org/10.1016/j.sigpro.2012.06.009
Article Google Scholar
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Computer Society Conference on Computer Vision and Pattern Recognition, pp 4305–4314 https://doi.org/10.1109/CVPR.2015.7299059
Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal Pyramid Network for Video Action Recognition. In: Computer Vision and Pattern Recognition (CVPR). IEEE, pp 2097–2106 https://doi.org/10.1109/CVPR.2017.226
Wu G, Mahoor MH, Althloothi S, Voyles RM (2010) SIFT-Motion Estimation (SIFT-ME): A New Feature for Human Activity Recognition. In: IPCV, pp 804–811
Wu W, Kan M, Liu X, Yang Y, Shan S, Chen X (2017) Recursive Spatial Transformer (ReST) for Alignment-Free Face Recognition. In: International Conference on Computer Vision. IEEE, pp 3792–3800 https://doi.org/10.1109/ICCV.2017.407
Xue F, Zhang W, Xue F, Li D, Xie S, Fleischer J (2021) A novel intelligent fault diagnosis method of rolling bearing based on two-stream feature fusion convolutional neural network. Measurement 176:109226. https://doi.org/10.1016/j.measurement.2021.109226
Article Google Scholar
Yao G, Lei T, Zhong J, Jiang P (2019) Learning multi-temporal-scale deep information for action recognition. Applied Intelligence 49(6):2017–2029. https://doi.org/10.1007/s10489-018-1347-3
Article Google Scholar
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14. MIT Press, Cambridge, MA, USA, p 3320–3328
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European Conference on Computer Vision. Springer, pp 818–833 https://doi.org/10.1007/978-3-319-10590-1_53
Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS (2019) A Comprehensive Survey of Vision-Based Human Action Recognition Methods. Sensors 19(5):1005. https://doi.org/10.3390/s19051005
Article Google Scholar
Zhang J, Hu H (2019) Domain learning joint with semantic adaptation for human action recognition. Pattern Recognition 90:196–209. https://doi.org/10.1016/j.patcog.2019.01.027
Article Google Scholar
Zhao J, Snoek CGM (2019) Dance With Flow: Two-In-One Stream Action Detection. In: Conference on Computer Vision and Pattern Recognition. IEEE, pp 9927–9936 https://doi.org/10.1109/CVPR.2019.01017
Zhao L, Tang P, Huo L (2014) A 2-D wavelet decomposition-based bag-of-visual-words model for land-use scene classification. International Journal of Remote Sensing 35(6):2296–2310. https://doi.org/10.1080/01431161.2014.890762
Article Google Scholar
Zhao LJ, Tang P, Huo LZ (2014) Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7(12):4620–4631. https://doi.org/10.1109/JSTARS.2014.2339842
Article Google Scholar
Zhu Y, Lan Z, Newsam S, Hauptmann A (2019) Hidden Two-Stream Convolutional Networks for Action Recognition. In: Asian Conference on Computer Vision, pp 363–378 https://doi.org/10.1007/978-3-030-20893-6_23

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China (NSFC) (61203172); the Sichuan Science and Technology Programs (2019YFH0187, 2020018); and the European Commission (598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP).

Author information

Authors and Affiliations

School of Computing and Engineering, University of Huddersfield, HD1 3DH, Huddersfield, UK
Chaolong Zhang & Zhijie Xu
School of Software Engineering, Chengdu University of Information Technology, Chengdu, China
Chaolong Zhang, Yuanping Xu, Jian Huang & Jun Lu

Authors

Chaolong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanping Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhijie Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuanping Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, C., Xu, Y., Xu, Z. et al. Hybrid handcrafted and learned feature framework for human action recognition. Appl Intell 52, 12771–12787 (2022). https://doi.org/10.1007/s10489-021-03068-w

Download citation

Accepted: 30 November 2021
Published: 12 February 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10489-021-03068-w

Hybrid handcrafted and learned feature framework for human action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel 3D Human Action Recognition Framework for Video Content Analysis

Two-stream spatiotemporal feature fusion for human action recognition

Learning multi-temporal-scale deep information for action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Hybrid handcrafted and learned feature framework for human action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel 3D Human Action Recognition Framework for Video Content Analysis

Two-stream spatiotemporal feature fusion for human action recognition

Learning multi-temporal-scale deep information for action recognition

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation