[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Learning motion and content-dependent features with convolutions for action recognition

Published: 01 November 2016 Publication History

Abstract

A variety of recognizing architectures based on deep convolutional neural networks have been devised for labeling videos containing human motion with action labels. However, so far, most works cannot properly deal with the temporal dynamics encoded in multiple contiguous frames, which distinguishes action recognition from other recognition tasks. This paper develops a temporal extension of convolutional neural networks to exploit motion-dependent features for recognizing human action in video. Our approach differs from other recent attempts in that it uses multiplicative interactions between convolutional outputs to describe motion information across contiguous frames. Interestingly, the representation of image content arises when we are at work on extracting motion pattern, which makes our model effectively incorporate both of them to analysis video. Additional theoretical analysis proves that motion and content-dependent features arise simultaneously from the developed architecture, whereas previous works mostly deal with the two separately. Our architecture is trained and evaluated on the standard video actions benchmarks of KTH and UCF101, where it matches the state of the art and has distinct advantages over previous attempts to use deep convolutional architectures for action recognition.

References

[1]
Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284---299
[2]
Aggarwal J., Ryoo MS (2011) Human activity analysis: A review. ACM Comput Surveys (CSUR) 43(3):16
[3]
Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence. IEEE Transactions on 35(8):1798---1828
[4]
Bouagar S, Larabi S (2014) Efficient descriptor for full and partial shape matching. Multimedia Tools and Applications pp. 1---23
[5]
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65---72. IEEE
[6]
Guo J, Kim J (2011) Adaptive motion vector smoothing for improving side information in distributed video coding. J Inf Process Syst 7(1):103---110
[7]
van Hateren JH, Ruderman DL (1998) Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London. Series B: Biol Sci 265 (1412):2315---2320
[8]
Heider F, Simmel M (1944) An experimental study of apparent behavior. The American Journal of Psychology
[9]
Horn RA, Johnson CR (2012) Matrix analysis. Cambridge university press
[10]
Hyvärinen A, Hurri J, Hoyer PO (2009) Natural Image Statistics: A Probabilistic Approach to Early Computational Vision., vol. 39. Springer
[11]
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 35(1):221---231
[12]
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[13]
Kim H, Lee SH, Sohn MK, Kim DJ (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4(1):1---12
[14]
Konda KR, Memisevic R, Michalski V (2013) The role of spatio-temporal synchrony in the encoding of motion. arXiv:CoRR1306.3162
[15]
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems
[16]
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107---123
[17]
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE
[18]
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE
[19]
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278---2324
[20]
Liu S, Fu W, He L, Zhou J, Ma M (2014) Distribution of primary additional errors in fractal encoding method. Multimedia Tools and Applications pp. 1---16. 10.1007/s11042-014-2408-1
[21]
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE Conference on Computer Vision & Pattern Recognition
[22]
Memisevic R (2011) Gradient-based learning of higher-order image features. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE
[23]
Memisevic R (2013) Learning to relate images. Pattern Analysis and Machine Intelligence. IEEE Trans 35(8):1829---1846
[24]
Mobahi H, Collobert R, Weston J (2009) Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM
[25]
Ng CK, Ee GK, Noordin N, Fam JG (2013) Finger triggered virtual musical instruments. J Converg 4(1):39---46
[26]
Olshausen BA (2003) Learning sparse, overcomplete representations of time-varying natural images. In: Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, vol. 1, IEEE
[27]
Sanin A, Sanderson C, Harandi MT, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: Applications of Computer Vision (WACV), 2013 IEEE Workshop on, IEEE
[28]
Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require? In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE
[29]
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 3, pp. 32---36. IEEE
[30]
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems
[31]
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
[32]
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842
[33]
Taylor GW, Fergus R, LeCun Y, Bregler C (2010)
[34]
Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: A survey. Circuits and Systems for Video Technology. IEEE Trans 18(11):1473---1488
[35]
Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. The Visual Comput 29(10):983---1009
[36]
Wang H, Klaser A, Schmid C, Liu C.L. (2011) Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE
[37]
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C et al (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference
[38]
Wang Y, Mori G (2009) Human action recognition by semilatent topic models. Pattern Analysis and Machine Intelligence. IEEE Trans 31(10):1762---1774
[39]
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Computer Vision---ECCV 2008, Springer
[40]
Wiskott L, Sejnowski T (2002) Slow feature analysis: Unsupervised learning of invariances. Neural Comput 14(4):715---770
[41]
Zhang Z, Tao D (2012) Slow feature analysis for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 34(3):436---450

Cited By

View all
  • (2021)Effect of Basicity on the Microstructure of Sinter and Its Application Based on Deep LearningComputational Intelligence and Neuroscience10.1155/2021/10828342021Online publication date: 1-Jan-2021
  • (2019)Novel multi-convolutional neural network fusion approach for smile recognitionMultimedia Tools and Applications10.1007/s11042-018-6945-x78:12(15887-15907)Online publication date: 1-Jun-2019
  • (2018)A discriminative structural model for joint segmentation and recognition of human actionsMultimedia Tools and Applications10.1007/s11042-018-6189-977:24(31627-31645)Online publication date: 1-Dec-2018
  • Show More Cited By
  1. Learning motion and content-dependent features with convolutions for action recognition

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Multimedia Tools and Applications
    Multimedia Tools and Applications  Volume 75, Issue 21
    November 2016
    987 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 November 2016

    Author Tags

    1. Action recognition
    2. Convolutional neural networks
    3. Deep learning
    4. Multiplicative interactions
    5. Spatiotemporal

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Effect of Basicity on the Microstructure of Sinter and Its Application Based on Deep LearningComputational Intelligence and Neuroscience10.1155/2021/10828342021Online publication date: 1-Jan-2021
    • (2019)Novel multi-convolutional neural network fusion approach for smile recognitionMultimedia Tools and Applications10.1007/s11042-018-6945-x78:12(15887-15907)Online publication date: 1-Jun-2019
    • (2018)A discriminative structural model for joint segmentation and recognition of human actionsMultimedia Tools and Applications10.1007/s11042-018-6189-977:24(31627-31645)Online publication date: 1-Dec-2018
    • (2018)Extracting hierarchical spatial and temporal features for human action recognitionMultimedia Tools and Applications10.1007/s11042-017-5179-777:13(16053-16068)Online publication date: 1-Jul-2018
    • (2017)Learning correlations for human action recognition in videosMultimedia Tools and Applications10.1007/s11042-017-4416-476:18(18891-18913)Online publication date: 1-Sep-2017

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media