[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Deep Ensemble Learning for Human Action Recognition in Still Images

Published: 01 January 2020 Publication History

Abstract

Numerous human actions such as “Phoning,” “PlayingGuitar,” and “RidingHorse” can be inferred by static cue-based approaches even if their motions in video are available considering one single still image may already sufficiently explain a particular action. In this research, we investigate human action recognition in still images and utilize deep ensemble learning to automatically decompose the body pose and perceive its background information. Firstly, we construct an end-to-end NCNN-based model by attaching the nonsequential convolutional neural network (NCNN) module to the top of the pretrained model. The nonsequential network topology of NCNN can separately learn the spatial- and channel-wise features with parallel branches, which helps improve the model performance. Subsequently, in order to further exploit the advantage of the nonsequential topology, we propose an end-to-end deep ensemble learning based on the weight optimization (DELWO) model. It contributes to fusing the deep information derived from multiple models automatically from the data. Finally, we design the deep ensemble learning based on voting strategy (DELVS) model to pool together multiple deep models with weighted coefficients to obtain a better prediction. More importantly, the model complexity can be reduced by lessening the number of trainable parameters, thereby effectively mitigating overfitting issues of the model in small datasets to some extent. We conduct experiments in Li’s action dataset, uncropped and 1.5x cropped Willow action datasets, and the results have validated the effectiveness and robustness of our proposed models in terms of mitigating overfitting issues in small datasets. Finally, we open source our code for the model in GitHub (https://github.com/yxchspring/deep_ensemble_learning) in order to share our model with the community.

References

[1]
V. Delaitre, I. Laptev, and J. Sivic, “Recognizing human actions in still images: a study of bag-of-features and part-based representations,” in Proceedings of the BMVC 2010-21st British Machine Vision Conference, Aberystwyth, UK, August 2010.
[2]
G. Yao, T. Lei, and J. Zhong, “A review of convolutional-neural-network-based action recognition,” Pattern Recognition Letters, vol. 118, pp. 14–22, 2019.
[3]
W. Xu, Z. Miao, J. Yu, and Q. Ji, “Action recognition and localization with spatial and temporal contexts,” Neurocomputing, vol. 333, pp. 351–363, 2019.
[4]
M. Majd and R. Safabakhsh, “Correlational convolutional LSTM for human action recognition,” Neurocomputing, 2019.
[5]
C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941, Las Vegas, NV, USA, July 2016.
[6]
H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action recognition with dynamic image networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 2799–2813, 2018.
[7]
P. Li, J. Ma, and S. Gao, “Actions in still web images: visualization, detection and retrieval,” in Proceedings of the International Conference on Web-Age Information Management, pp. 302–313, Wuhan, China, September 2011.
[8]
S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2013.
[9]
Y. Zhang, L. Cheng, J. Wu, J. Cai, M. N. Do, and J. Lu, “Action recognition in still images with minimum annotation efforts,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5479–5490, 2016.
[10]
F. S. Khan, R. Muhammad Anwer, J. Van De Weijer, A. D. Bagdanov, A. M. Lopez, and M. Felsberg, “Coloring action recognition in still images,” International Journal of Computer Vision, vol. 105, no. 3, pp. 205–221, 2013.
[11]
G. Guo and A. Lai, “A survey on still image based human action recognition,” Pattern Recognition, vol. 47, no. 10, pp. 3343–3361, 2014.
[12]
S. Abidi, M. Piccardi, and M. A. Williams, “Action recognition in still images by latent superpixel classification,” 2015, https://arxiv.org/abs/1507.08363.
[13]
Z. Liang, X. Wang, R. Huang, and L. Lin, “An expressive deep model for human action parsing from a single image,” in Proceedings of the 2014 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, IEEE, Chengdu, China, July 2014.
[14]
T. Qi, Y. Xu, Y. Quan, Y. Wang, and H. Ling, “Image-based action recognition using hint-enhanced deep neural networks,” Neurocomputing, vol. 267, pp. 475–488, 2017.
[15]
J. Kong, B. Zan, and M. Jiang, “Human action recognition using depth motion maps pyramid and discriminative collaborative representation classifier,” Journal of Electronic Imaging, vol. 27, no. 3, 2018.
[16]
A. Gupta, A. Kembhavi, and L. S. Davis, “Observing human-object interactions: using spatial and functional compatibility for recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1775–1789, 2009.
[17]
G. Zhang, S. Jia, X. Zhang, and X. Li, “Saliency-based foreground trajectory extraction using multiscale hybrid masks for action recognition,” Journal of Electronic Imaging, vol. 27, no. 5, 2018.
[18]
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
[19]
B. Ko, J. Hong, and J.-Y. Nam, “Human action recognition in still images using action poselets and a two-layer classification model,” Journal of Visual Languages & Computing, vol. 28, pp. 163–175, 2015.
[20]
L. Cai, X. Liu, F. Chen, and M. Xiang, “Robust human action recognition based on depth motion maps and improved convolutional neural network,” Journal of Electronic Imaging, vol. 27, no. 5, 2018.
[21]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, https://arxiv.org/abs/1409.1556.
[22]
Y. LeCun, B. E. Boser, J. S. Denker et al., “Handwritten digit recognition with a back-propagation network,” Advances in Neural Information Processing Systems, vol. 2, pp. 396–404, 1990.
[23]
C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, Boston, MA, USA, June 2015.
[24]
M. Lin, Q. Chen, and S. Yan, “Network in network,” 2013, https://arxiv.org/abs/1312.4400.
[25]
A. M. Nickfarjam and H. Ebrahimpour-Komleh, “Shape-based human action recognition using multi-input topology of deep belief networks,” in Proceedings of the 2017 9th International Conference on Information and Knowledge Technology (IKT), pp. 1–4, IEEE, Tehran, Iran, October 2017.
[26]
O. Oktay, W. Bai, M. Lee et al., “Multi-input cardiac image super-resolution using convolutional neural networks,” in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 246–254, Athens, Greece, October 2016.
[27]
Y. Sun, L. Zhu, G. Wang, and F. Zhao, “Multi-input convolutional neural network for flower grading,” Journal of Electrical and Computer Engineering, vol. 2017, 8 pages, 2017.
[28]
Y. Fujita, R. Takashima, T. Homma, and M. Togami, “Data augmentation using multi-input multi-output source separation for deep neural network based acoustic modeling,” in Proceedings of the Interspeech, pp. 3818–3822, San Francisco, CA, USA, September 2016.
[29]
B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Woźniak, “Ensemble learning for data stream analysis: a survey,” Information Fusion, vol. 37, pp. 132–156, 2017.
[30]
X.-L. Zhang and D. Wang, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 967–977, 2016.
[31]
F. Chollet and J. J. Allaire, “Advanced deep-learning best practices,” in Deep learning with R, vol. 218–249, Manning Publications Co., Shelter Island, NY, USA, 2018.
[32]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, Las Vegas, NV, USA, June 2016.
[33]
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708, Honolulu, HI, USA, July 2017.
[34]
A. G. Howard, M. Zhu, B. Chen et al., “Mobilenets: efficient convolutional neural networks for mobile vision applications,” 2017, https://arxiv.org/abs/1704.04861.
[35]
H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008.
[36]
J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” in Proceedings Ninth IEEE International Conference on Computer Vision, p. 1470, IEEE, October 2003.
[37]
S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,” in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Volume 2 (CVPR’06), pp. 2169–2178, IEEE, Washington, DC, USA, 2006.
[38]
R. Ye and P. N. Suganthan, “Empirical comparison of bagging-based ensemble classifiers,” in Proceedings of the 2012 15th International Conference on Information Fusion, pp. 917–924, IEEE, Singapore, July 2012.
[39]
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[40]
H. Drucker, C. Cortes, L. D. Jackel, Y. LeCun, and V. Vapnik, “Boosting and other ensemble methods,” Neural Computation, vol. 6, no. 6, pp. 1289–1301, 1994.
[41]
A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,” Frontiers in Neurorobotics, vol. 7, p. 21, 2013.
[42]
R. Polikar, “Ensemble learning,” in Ensemble Machine Learning, pp. 1–34, Springer, Boston, MA, USA, 2012.
[43]
A. Oliva and A. Torralba, “Modeling the shape of the scene: a holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.
[44]
S. Shah, K. Khatri, P. Mhasakar, R. Nagar, and S. Raman, “Unsupervised GIST based clustering for object localization,” in Proceedings of the 2019 National Conference on Communications (NCC), pp. 1–6, IEEE, Atlanta, GA, USA, February 2019.
[45]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626, Venice, Italy, October 2017.

Cited By

View all
  • (2021)Person Reidentification Model Based on Multiattention Modules and Multiscale ResidualsComplexity10.1155/2021/66734612021Online publication date: 1-Jan-2021
  • (2021)Transfer learning with fine tuning for human action recognition from still imagesMultimedia Tools and Applications10.1007/s11042-021-10753-y80:13(20547-20578)Online publication date: 1-May-2021
  • (2020)Prediction of Future Terrorist Activities Using Deep Neural NetworksComplexity10.1155/2020/13730872020Online publication date: 1-Jan-2020
  • Show More Cited By

Index Terms

  1. Deep Ensemble Learning for Human Action Recognition in Still Images
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Complexity
        Complexity  Volume 2020, Issue
        2020
        17147 pages
        This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

        Publisher

        John Wiley & Sons, Inc.

        United States

        Publication History

        Published: 01 January 2020

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 15 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)Person Reidentification Model Based on Multiattention Modules and Multiscale ResidualsComplexity10.1155/2021/66734612021Online publication date: 1-Jan-2021
        • (2021)Transfer learning with fine tuning for human action recognition from still imagesMultimedia Tools and Applications10.1007/s11042-021-10753-y80:13(20547-20578)Online publication date: 1-May-2021
        • (2020)Prediction of Future Terrorist Activities Using Deep Neural NetworksComplexity10.1155/2020/13730872020Online publication date: 1-Jan-2020
        • (2020)Hybrid Ensemble Pruning Using Coevolution Binary Glowworm Swarm Optimization and Reduce-ErrorComplexity10.1155/2020/13296922020Online publication date: 31-Oct-2020

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media