More Web Proxy on the site http://driver.im/

article

Learning motion and content-dependent features with convolutions for action recognition

Authors:

Gelan YangAuthors Info & Claims

Multimedia Tools and Applications, Volume 75, Issue 21

Pages 13023 - 13039

https://doi.org/10.1007/s11042-015-2550-4

Published: 01 November 2016 Publication History

Abstract

A variety of recognizing architectures based on deep convolutional neural networks have been devised for labeling videos containing human motion with action labels. However, so far, most works cannot properly deal with the temporal dynamics encoded in multiple contiguous frames, which distinguishes action recognition from other recognition tasks. This paper develops a temporal extension of convolutional neural networks to exploit motion-dependent features for recognizing human action in video. Our approach differs from other recent attempts in that it uses multiplicative interactions between convolutional outputs to describe motion information across contiguous frames. Interestingly, the representation of image content arises when we are at work on extracting motion pattern, which makes our model effectively incorporate both of them to analysis video. Additional theoretical analysis proves that motion and content-dependent features arise simultaneously from the developed architecture, whereas previous works mostly deal with the two separately. Our architecture is trained and evaluated on the standard video actions benchmarks of KTH and UCF101, where it matches the state of the art and has distinct advantages over previous attempts to use deep convolutional architectures for action recognition.

References

[1]

Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284---299

[2]

Aggarwal J., Ryoo MS (2011) Human activity analysis: A review. ACM Comput Surveys (CSUR) 43(3):16

Digital Library

[3]

Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence. IEEE Transactions on 35(8):1798---1828

Digital Library

[4]

Bouagar S, Larabi S (2014) Efficient descriptor for full and partial shape matching. Multimedia Tools and Applications pp. 1---23

Digital Library

[5]

Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65---72. IEEE

Digital Library

[6]

Guo J, Kim J (2011) Adaptive motion vector smoothing for improving side information in distributed video coding. J Inf Process Syst 7(1):103---110

[7]

van Hateren JH, Ruderman DL (1998) Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London. Series B: Biol Sci 265 (1412):2315---2320

[8]

Heider F, Simmel M (1944) An experimental study of apparent behavior. The American Journal of Psychology

[9]

Horn RA, Johnson CR (2012) Matrix analysis. Cambridge university press

Digital Library

[10]

Hyvärinen A, Hurri J, Hoyer PO (2009) Natural Image Statistics: A Probabilistic Approach to Early Computational Vision., vol. 39. Springer

Digital Library

[11]

Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 35(1):221---231

Digital Library

[12]

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Digital Library

[13]

Kim H, Lee SH, Sohn MK, Kim DJ (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4(1):1---12

Digital Library

[14]

Konda KR, Memisevic R, Michalski V (2013) The role of spatio-temporal synchrony in the encoding of motion. arXiv:CoRR1306.3162

[15]

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems

Digital Library

[16]

Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107---123

Digital Library

[17]

Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE

[18]

Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE

Digital Library

[19]

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278---2324

[20]

Liu S, Fu W, He L, Zhou J, Ma M (2014) Distribution of primary additional errors in fractal encoding method. Multimedia Tools and Applications pp. 1---16. 10.1007/s11042-014-2408-1

[21]

Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE Conference on Computer Vision & Pattern Recognition

[22]

Memisevic R (2011) Gradient-based learning of higher-order image features. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE

Digital Library

[23]

Memisevic R (2013) Learning to relate images. Pattern Analysis and Machine Intelligence. IEEE Trans 35(8):1829---1846

Digital Library

[24]

Mobahi H, Collobert R, Weston J (2009) Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM

Digital Library

[25]

Ng CK, Ee GK, Noordin N, Fam JG (2013) Finger triggered virtual musical instruments. J Converg 4(1):39---46

[26]

Olshausen BA (2003) Learning sparse, overcomplete representations of time-varying natural images. In: Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, vol. 1, IEEE

[27]

Sanin A, Sanderson C, Harandi MT, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: Applications of Computer Vision (WACV), 2013 IEEE Workshop on, IEEE

Digital Library

[28]

Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require? In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE

[29]

Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 3, pp. 32---36. IEEE

Digital Library

[30]

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems

Digital Library

[31]

Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

[32]

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842

[33]

Taylor GW, Fergus R, LeCun Y, Bregler C (2010)

[34]

Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: A survey. Circuits and Systems for Video Technology. IEEE Trans 18(11):1473---1488

Digital Library

[35]

Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. The Visual Comput 29(10):983---1009

[36]

Wang H, Klaser A, Schmid C, Liu C.L. (2011) Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE

Digital Library

[37]

Wang H, Ullah MM, Klaser A, Laptev I, Schmid C et al (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference

[38]

Wang Y, Mori G (2009) Human action recognition by semilatent topic models. Pattern Analysis and Machine Intelligence. IEEE Trans 31(10):1762---1774

Digital Library

[39]

Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Computer Vision---ECCV 2008, Springer

Digital Library

[40]

Wiskott L, Sejnowski T (2002) Slow feature analysis: Unsupervised learning of invariances. Neural Comput 14(4):715---770

Digital Library

[41]

Zhang Z, Tao D (2012) Slow feature analysis for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 34(3):436---450

Digital Library

Cited By

Zhi JLi JWang JJiang THua Z(2021)Effect of Basicity on the Microstructure of Sinter and Its Application Based on Deep LearningComputational Intelligence and Neuroscience10.1155/2021/10828342021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/1082834
Chen JJin YAkram MLi KChen E(2019)Novel multi-convolutional neural network fusion approach for smile recognitionMultimedia Tools and Applications10.1007/s11042-018-6945-x78:12(15887-15907)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s11042-018-6945-x
Liu CHou JWu XJia Y(2018)A discriminative structural model for joint segmentation and recognition of human actionsMultimedia Tools and Applications10.1007/s11042-018-6189-977:24(31627-31645)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11042-018-6189-9
Show More Cited By

Learning motion and content-dependent features with convolutions for action recognition
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

3D Convolutional Neural Networks for Human Action Recognition

We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that ...
Human Action Recognition using Pre-trained Convolutional Neural Networks
VSIP '20: Proceedings of the 2020 2nd International Conference on Video, Signal and Image Processing

Recognition of human action is one of the challenges in the field of artificial intelligence. Deep learning model has become a research issue in action recognition applications due to its ability to outperform traditional machine learning approaches. ...
XwiseNet: action recognition with Xwise separable convolutions
Abstract
With the emergence of a large number of video resources, video action recognition is attracting much attention. Recently, realizing the outstanding performance of three-dimensional (3D) convolutional neural networks (CNNs), many works have began ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Multimedia Tools and Applications

Multimedia Tools and Applications Volume 75, Issue 21

November 2016

987 pages

ISSN:1380-7501

Issue’s Table of Contents

Copyright © Copyright © 2016 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 November 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhi JLi JWang JJiang THua Z(2021)Effect of Basicity on the Microstructure of Sinter and Its Application Based on Deep LearningComputational Intelligence and Neuroscience10.1155/2021/10828342021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/1082834
Chen JJin YAkram MLi KChen E(2019)Novel multi-convolutional neural network fusion approach for smile recognitionMultimedia Tools and Applications10.1007/s11042-018-6945-x78:12(15887-15907)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s11042-018-6945-x
Liu CHou JWu XJia Y(2018)A discriminative structural model for joint segmentation and recognition of human actionsMultimedia Tools and Applications10.1007/s11042-018-6189-977:24(31627-31645)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11042-018-6189-9
Zhang KZhang L(2018)Extracting hierarchical spatial and temporal features for human action recognitionMultimedia Tools and Applications10.1007/s11042-017-5179-777:13(16053-16068)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1007/s11042-017-5179-7
Yi YWang HZhang B(2017)Learning correlations for human action recognition in videosMultimedia Tools and Applications10.1007/s11042-017-4416-476:18(18891-18913)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s11042-017-4416-4

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents