[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A novel hierarchical Bag-of-Words model for compact action representation

Published: 22 January 2016 Publication History

Abstract

Bag-of-Words (BoW) histogram of local space-time features is very popular for action representation due to its high compactness and robustness. However, its discriminant ability is limited since it only depends on the occurrence statistics of local features. Alternative models such as Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors (FV) include more information by aggregating high-dimensional residual vectors, but they suffer from the problem of high dimensionality for final representation. To solve this problem, we novelly propose to compress residual vectors into low-dimensional residual histograms by the simple but efficient BoW quantization. To compensate the information loss of this quantization, we iteratively collect higher-order residual vectors to produce high-order residual histograms. Concatenating these histograms yields a hierarchical BoW (HBoW) model which is not only compact but also informative. In experiments, the performances of HBoW are evaluated on four benchmark datasets: HMDB51, Olympic Sports, UCF Youtube and Hollywood2. Experiment results show that HBoW yields much more compact action representation than VLAD and FV, without sacrificing recognition accuracy. Comparisons with state-of-the-art works confirm its superiority further.

References

[1]
R. Marfil, J. Dias, F. Escolano, Recognition and action for scene understanding, Neurocomputing, 161 (2015) 1-2.
[2]
P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor and its application to action recognition, in: ACM International Conference on Multimedia, 2007, pp. 357360.
[3]
I. Laptev, M. Marszalek, C. Schmid, Learning realistic human actions from movies, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 18.
[4]
A. Klser, M. Marszalek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in: British Machine Vision Conference, 2008, pp. 9951004.
[5]
H. Wang H, A. Klser, C. Schmid, C.L. Liu, Dense trajectories and motion boundary descriptors for action recognition, Int. J. Comput. Vis., 103 (2013) 60-79.
[6]
H. Wang, M.M. Ullah, A. Klser, I. Laptev, C. Schmid, Evaluation of local spatio-temporal features for action recognition, in: British Machine Vision Conference, 2009, pp. 124.1124.11.
[7]
K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, in: British Machine Vision Conference, 2011, pp. 112.
[8]
J. Malik, P. Perona, Preattentive texture discrimination with early vision mechanisms, J. Opt. Soc. Am. A, 7 (1990) 923-932.
[9]
T. Leung, J. Malik, Recognizing surfaces using three-dimensional textons, in: IEEE International Conference on Computer Vision, vol. 2, 1999, pp. 10101017.
[10]
X. Peng, L. Wang, X. Wang, Y. Qiao, Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint, 2014, arXiv:1405.4506.
[11]
Y. Tian, Q. Ruan, G. An, W. Xu, Context and locality constrained linear coding for human action recognition, Neurocomputing, 2015, in press.
[12]
M. Liu, H. Liu, Q. Sun, Action classification by exploring directional co-occurrence of weighted STIPS, in: IEEE International Conference on Image Processing, 2014, pp. 14601464.
[13]
J. Yu, M. Jeon, W. Pedrycz, Weighted feature trajectories and concatenated bag-of-features for action recognition, Neurocomputing, 131 (2014) 200-207.
[14]
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 33603367.
[15]
F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 18.
[16]
F. Perronnin, J. Snchez, T. Mensink, Improving the fisher kernel for large-scale image classification, Lecture Notes in Computer Science, vol. 6314, 2010, pp. 143156.
[17]
X. Zhou, K. Yu, T. Zhang, T.S. Huang, Image classification using super-vector coding of local image descriptors, Lecture Notes in Computer Science, vol. 6315, 2010, pp. 141154.
[18]
X. Lian, Z. Li, B. Lu, L. Zhang, Max-margin dictionary learning for multiclass image categorization, Lecture Notes in Computer Science, vol. 6314, 2010, pp. 157170
[19]
J.C. van Gemert, J.M. Geusebroek, C.J. Veenman, A.W.M. Smeulders, Kernel codebooks for scene categorization, Lecture Notes in Computer Science, vol. 5304, 2008, pp. 696709.
[20]
H. Jgou, F. Perronnin, M. Douze, J. Snchez, P. Prez, C. Schmid, Aggregating local image descriptors into compact codes, IEEE Trans. Pattern Anal. Mach. Intell., 43 (2012) 1704-1716.
[21]
H. Liu, M. Yuan, F. Sun., RGB-D action recognition using linear coding, Neurocomputing, 149 (2015) 79-85.
[22]
F. Moayedi, Z. Azimifar, R. Boostani, Structured sparse representation for human action recognition, Neurocomputing, 161 (2015) 38-46.
[23]
R. Arandjelovi, A. Zisserman, All about VLAD, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 15781585.
[24]
X. Peng, L. Wang, Y. Qiao, Q. Peng, Boosting VLAD with supervised dictionary learning and high-order statistics, Lecture Notes in Computer Science, vol. 8691, 2014, pp. 660674.
[25]
X. Peng, Y. Qiao, Q. Peng, Q. Wang., Large margin dimensionality reduction for action similarity labeling, Signal Process. Lett., 21 (2014) 1022-1025.
[26]
J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories, Int. J. Comput. Vis., 73 (2007) 213-238.
[27]
D. Oneata, J. Verbeek, C. Schmid, Efficient action localization with approximately normalized Fisher vectors, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 25452552.
[28]
E.H. Taralova, Fernando De La Torre, M. Hebert, Motion words for videos, Lecture Notes in Computer Science, vol. 8689, 2014, pp. 725740.
[29]
X. Peng, C. Zou, Y. Qiao, Q. Wang, Action recognition with stacked fisher vectors, Lecture Notes in Computer Science, vol. 8693, 2014, pp. 581595.
[30]
S. Ozkan, T. Ates, E. Tola, M. Soysal, E. Esen, Performance analysis of state-of-the-art representation methods for geographical image retrieval and categorization, IEEE Geosci. Remote Sens. Lett., 11 (2014) 1996-2000.
[31]
X. Wang, L. Wang, Y. Qiao, A comparative study of encoding, pooling and normalization methods for action recognition, Lecture Notes in Computer Science, 2012, pp. 572585.
[32]
J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 19962003.
[33]
M. Marszalek, I. Laptev, C. Schmid. Actions in context, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 29292936.
[34]
D. Oneata, J. Verbeek, C. Schmid, Action and event recognition with Fisher vectors on a compact feature set, in: IEEE International Conference on Computer Vision, 2013, pp. 18171824.
[35]
A. Iosifidis, A. Tefas, I. Pitas, Discriminant bag of Words based representation for human action recognition, Pattern Recognit. Lett., 49 (2014) 185-1924.
[36]
S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006, pp. 21692178.
[37]
A. Vedaldi, B. Fulkerson, VLFeat: an open and portable library of computer vision algorithms http://www.vlfeat.org/, 2008.
[38]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: a large video database for human motion recognition, in: IEEE International Conference on Computer Vision, 2011, pp. 25562563.
[39]
Y.G. Jiang, Q. Dai, X. Xue, W. Liu, C.W. Ngo, Trajectory-based modeling of human actions with motion reference points, Lecture Notes in Computer Science, vol. 7576, 2012, pp. 425438
[40]
M. Jain, H. Jgou, P. Bouthemy, Better exploiting motion for better action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 25552562.
[41]
J.C. Niebles, C.W. Chen, F.F. Li, Modeling temporal structure of decomposable motion segments for activity classification, Lecture Notes in Computer Science, vol. 6312, 2010, pp. 392405.
[42]
I. Laptev, Spatio-temporal interest point library, 2011. www.di.ens.fr/laptev/~interestpoints.html
[43]
Q. Sun, H. Liu, Inferring ongoing human activities based on recurrent self-organizing map trajectory, in: British Machine Vision Conference, 2013, pp. 11.111.10.
[44]
Z. Cai, Y. Qiao, Multi-view super vector for action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 596603.
[45]
H. Wang, A. Klser, C. Schmid, C. Liu, Action Recognition by dense trajectories, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 31693176.
[46]
A. Iosifidis, A. Tefas, I. Pitas, Class-specific reference discriminant analysis with application in human behavior analysis, IEEE Trans. Hum.-Mach. Syst., 45 (2015) 315-326.
[47]
A. Iosifidis, A. Tefas, I. Pitas, Graph embedded extreme learning machine, IEEE Trans. Cybern. (2015).

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neurocomputing
Neurocomputing  Volume 174, Issue PB
January 2016
610 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 22 January 2016

Author Tags

  1. Action representation
  2. Bag-of-Words
  3. Fisher Vectors
  4. Vector of Locally Aggregated Descriptors

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Double constrained bag of words for human action recognitionImage Communication10.1016/j.image.2021.11639998:COnline publication date: 1-Oct-2021
  • (2019)Action representations in roboticsInternational Journal of Robotics Research10.1177/027836491983502038:5(518-562)Online publication date: 1-Apr-2019
  • (2018)Hierarchical and Spatio-Temporal Sparse Representation for Human Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2017.278819627:4(1748-1762)Online publication date: 1-Apr-2018
  • (2018)A hierarchical representation for human action recognition in realistic scenesMultimedia Tools and Applications10.1007/s11042-018-5626-077:9(11403-11423)Online publication date: 1-May-2018
  • (2018)NMF with local constraint and Deep NMF with temporal dependencies constraint for action recognitionNeural Computing and Applications10.1007/s00521-018-3685-932:9(4481-4505)Online publication date: 21-Aug-2018
  • (2017)Online growing neural gas for anomaly detection in changing surveillance scenesPattern Recognition10.1016/j.patcog.2016.09.01664:C(187-201)Online publication date: 1-Apr-2017
  • (2016)Action Recognition Using Local Consistent Group Sparse Coding with Spatio-Temporal StructureProceedings of the 24th ACM international conference on Multimedia10.1145/2964284.2967234(317-321)Online publication date: 1-Oct-2016

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media