More Web Proxy on the site http://driver.im/

research-article

A novel hierarchical Bag-of-Words model for compact action representation

Authors:

Tianwei ZhangAuthors Info & Claims

Neurocomputing, Volume 174, Issue PB

Pages 722 - 732

https://doi.org/10.1016/j.neucom.2015.09.074

Published: 22 January 2016 Publication History

Abstract

Bag-of-Words (BoW) histogram of local space-time features is very popular for action representation due to its high compactness and robustness. However, its discriminant ability is limited since it only depends on the occurrence statistics of local features. Alternative models such as Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors (FV) include more information by aggregating high-dimensional residual vectors, but they suffer from the problem of high dimensionality for final representation. To solve this problem, we novelly propose to compress residual vectors into low-dimensional residual histograms by the simple but efficient BoW quantization. To compensate the information loss of this quantization, we iteratively collect higher-order residual vectors to produce high-order residual histograms. Concatenating these histograms yields a hierarchical BoW (HBoW) model which is not only compact but also informative. In experiments, the performances of HBoW are evaluated on four benchmark datasets: HMDB51, Olympic Sports, UCF Youtube and Hollywood2. Experiment results show that HBoW yields much more compact action representation than VLAD and FV, without sacrificing recognition accuracy. Comparisons with state-of-the-art works confirm its superiority further.

References

[1]

R. Marfil, J. Dias, F. Escolano, Recognition and action for scene understanding, Neurocomputing, 161 (2015) 1-2.

Digital Library

[2]

P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor and its application to action recognition, in: ACM International Conference on Multimedia, 2007, pp. 357360.

Digital Library

[3]

I. Laptev, M. Marszalek, C. Schmid, Learning realistic human actions from movies, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 18.

[4]

A. Klser, M. Marszalek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in: British Machine Vision Conference, 2008, pp. 9951004.

[5]

H. Wang H, A. Klser, C. Schmid, C.L. Liu, Dense trajectories and motion boundary descriptors for action recognition, Int. J. Comput. Vis., 103 (2013) 60-79.

[6]

H. Wang, M.M. Ullah, A. Klser, I. Laptev, C. Schmid, Evaluation of local spatio-temporal features for action recognition, in: British Machine Vision Conference, 2009, pp. 124.1124.11.

[7]

K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, in: British Machine Vision Conference, 2011, pp. 112.

[8]

J. Malik, P. Perona, Preattentive texture discrimination with early vision mechanisms, J. Opt. Soc. Am. A, 7 (1990) 923-932.

[9]

T. Leung, J. Malik, Recognizing surfaces using three-dimensional textons, in: IEEE International Conference on Computer Vision, vol. 2, 1999, pp. 10101017.

[10]

X. Peng, L. Wang, X. Wang, Y. Qiao, Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint, 2014, arXiv:1405.4506.

[11]

Y. Tian, Q. Ruan, G. An, W. Xu, Context and locality constrained linear coding for human action recognition, Neurocomputing, 2015, in press.

[12]

M. Liu, H. Liu, Q. Sun, Action classification by exploring directional co-occurrence of weighted STIPS, in: IEEE International Conference on Image Processing, 2014, pp. 14601464.

[13]

J. Yu, M. Jeon, W. Pedrycz, Weighted feature trajectories and concatenated bag-of-features for action recognition, Neurocomputing, 131 (2014) 200-207.

Digital Library

[14]

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 33603367.

[15]

F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 18.

[16]

F. Perronnin, J. Snchez, T. Mensink, Improving the fisher kernel for large-scale image classification, Lecture Notes in Computer Science, vol. 6314, 2010, pp. 143156.

[17]

X. Zhou, K. Yu, T. Zhang, T.S. Huang, Image classification using super-vector coding of local image descriptors, Lecture Notes in Computer Science, vol. 6315, 2010, pp. 141154.

[18]

X. Lian, Z. Li, B. Lu, L. Zhang, Max-margin dictionary learning for multiclass image categorization, Lecture Notes in Computer Science, vol. 6314, 2010, pp. 157170

[19]

J.C. van Gemert, J.M. Geusebroek, C.J. Veenman, A.W.M. Smeulders, Kernel codebooks for scene categorization, Lecture Notes in Computer Science, vol. 5304, 2008, pp. 696709.

[20]

H. Jgou, F. Perronnin, M. Douze, J. Snchez, P. Prez, C. Schmid, Aggregating local image descriptors into compact codes, IEEE Trans. Pattern Anal. Mach. Intell., 43 (2012) 1704-1716.

Digital Library

[21]

H. Liu, M. Yuan, F. Sun., RGB-D action recognition using linear coding, Neurocomputing, 149 (2015) 79-85.

Digital Library

[22]

F. Moayedi, Z. Azimifar, R. Boostani, Structured sparse representation for human action recognition, Neurocomputing, 161 (2015) 38-46.

Digital Library

[23]

R. Arandjelovi, A. Zisserman, All about VLAD, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 15781585.

[24]

X. Peng, L. Wang, Y. Qiao, Q. Peng, Boosting VLAD with supervised dictionary learning and high-order statistics, Lecture Notes in Computer Science, vol. 8691, 2014, pp. 660674.

[25]

X. Peng, Y. Qiao, Q. Peng, Q. Wang., Large margin dimensionality reduction for action similarity labeling, Signal Process. Lett., 21 (2014) 1022-1025.

[26]

J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories, Int. J. Comput. Vis., 73 (2007) 213-238.

Digital Library

[27]

D. Oneata, J. Verbeek, C. Schmid, Efficient action localization with approximately normalized Fisher vectors, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 25452552.

[28]

E.H. Taralova, Fernando De La Torre, M. Hebert, Motion words for videos, Lecture Notes in Computer Science, vol. 8689, 2014, pp. 725740.

[29]

X. Peng, C. Zou, Y. Qiao, Q. Wang, Action recognition with stacked fisher vectors, Lecture Notes in Computer Science, vol. 8693, 2014, pp. 581595.

[30]

S. Ozkan, T. Ates, E. Tola, M. Soysal, E. Esen, Performance analysis of state-of-the-art representation methods for geographical image retrieval and categorization, IEEE Geosci. Remote Sens. Lett., 11 (2014) 1996-2000.

[31]

X. Wang, L. Wang, Y. Qiao, A comparative study of encoding, pooling and normalization methods for action recognition, Lecture Notes in Computer Science, 2012, pp. 572585.

[32]

J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 19962003.

[33]

M. Marszalek, I. Laptev, C. Schmid. Actions in context, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 29292936.

[34]

D. Oneata, J. Verbeek, C. Schmid, Action and event recognition with Fisher vectors on a compact feature set, in: IEEE International Conference on Computer Vision, 2013, pp. 18171824.

[35]

A. Iosifidis, A. Tefas, I. Pitas, Discriminant bag of Words based representation for human action recognition, Pattern Recognit. Lett., 49 (2014) 185-1924.

Digital Library

[36]

S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006, pp. 21692178.

[37]

A. Vedaldi, B. Fulkerson, VLFeat: an open and portable library of computer vision algorithms http://www.vlfeat.org/, 2008.

[38]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: a large video database for human motion recognition, in: IEEE International Conference on Computer Vision, 2011, pp. 25562563.

Digital Library

[39]

Y.G. Jiang, Q. Dai, X. Xue, W. Liu, C.W. Ngo, Trajectory-based modeling of human actions with motion reference points, Lecture Notes in Computer Science, vol. 7576, 2012, pp. 425438

[40]

M. Jain, H. Jgou, P. Bouthemy, Better exploiting motion for better action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 25552562.

[41]

J.C. Niebles, C.W. Chen, F.F. Li, Modeling temporal structure of decomposable motion segments for activity classification, Lecture Notes in Computer Science, vol. 6312, 2010, pp. 392405.

[42]

I. Laptev, Spatio-temporal interest point library, 2011. www.di.ens.fr/laptev/~interestpoints.html

[43]

Q. Sun, H. Liu, Inferring ongoing human activities based on recurrent self-organizing map trajectory, in: British Machine Vision Conference, 2013, pp. 11.111.10.

[44]

Z. Cai, Y. Qiao, Multi-view super vector for action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 596603.

[45]

H. Wang, A. Klser, C. Schmid, C. Liu, Action Recognition by dense trajectories, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 31693176.

[46]

A. Iosifidis, A. Tefas, I. Pitas, Class-specific reference discriminant analysis with application in human behavior analysis, IEEE Trans. Hum.-Mach. Syst., 45 (2015) 315-326.

[47]

A. Iosifidis, A. Tefas, I. Pitas, Graph embedded extreme learning machine, IEEE Trans. Cybern. (2015).

Cited By

Wu CLi YZhang YLiu B(2021)Double constrained bag of words for human action recognitionImage Communication10.1016/j.image.2021.11639998:COnline publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1016/j.image.2021.116399
Zech PRenaudo EHaller SZhang XPiater J(2019)Action representations in roboticsInternational Journal of Robotics Research10.1177/027836491983502038:5(518-562)Online publication date: 1-Apr-2019
https://dl.acm.org/doi/10.1177/0278364919835020
Tian YKong YRuan QAn GFu Y(2018)Hierarchical and Spatio-Temporal Sparse Representation for Human Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2017.278819627:4(1748-1762)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1109/TIP.2017.2788196
Show More Cited By

Recommendations

Early versus Late Dimensionality Reduction of Bag-of-Words Feature Representation for Image Classification
ICBRA '17: Proceedings of the 4th International Conference on Bioinformatics Research and Applications

Extracting the bag-of-words (BoW) feature from images has been widely used for image classification. In general, some local keypoints are first of all detected from each image and the keypoint descriptor, such as scale-invariant feature transform (SIFT),...
Pooling in image representation: The visual codeword point of view

In this work, we propose BossaNova, a novel representation for content-based concept detection in images and videos, which enriches the Bag-of-Words model. Relying on the quantization of highly discriminant local descriptors by a codebook, and the ...
Action recognition by fusing depth video and skeletal data information

Two action recognition approaches that utilize depth videos and skeletal information are proposed in this paper. Dense trajectories are used to represent the depth video data. Skeletal data are represented by vectors of skeleton joints positions and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 174, Issue PB

January 2016

610 pages

ISSN:0925-2312

Issue’s Table of Contents

Copyright © Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 22 January 2016

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu CLi YZhang YLiu B(2021)Double constrained bag of words for human action recognitionImage Communication10.1016/j.image.2021.11639998:COnline publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1016/j.image.2021.116399
Zech PRenaudo EHaller SZhang XPiater J(2019)Action representations in roboticsInternational Journal of Robotics Research10.1177/027836491983502038:5(518-562)Online publication date: 1-Apr-2019
https://dl.acm.org/doi/10.1177/0278364919835020
Tian YKong YRuan QAn GFu Y(2018)Hierarchical and Spatio-Temporal Sparse Representation for Human Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2017.278819627:4(1748-1762)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1109/TIP.2017.2788196
Lei QZhang HXin MCai Y(2018)A hierarchical representation for human action recognition in realistic scenesMultimedia Tools and Applications10.1007/s11042-018-5626-077:9(11403-11423)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1007/s11042-018-5626-0
Tong MChen YMa LBai HYue X(2018)NMF with local constraint and Deep NMF with temporal dependencies constraint for action recognitionNeural Computing and Applications10.1007/s00521-018-3685-932:9(4481-4505)Online publication date: 21-Aug-2018
https://dl.acm.org/doi/10.1007/s00521-018-3685-9
Sun QLiu HHarada T(2017)Online growing neural gas for anomaly detection in changing surveillance scenesPattern Recognition10.1016/j.patcog.2016.09.01664:C(187-201)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.patcog.2016.09.016
Tian YRuan QAn GFu YHanjalic ASnoek CWorring MBulterman DHuet BKelliher AKompatsiaris YLi J(2016)Action Recognition Using Local Consistent Group Sparse Coding with Spatio-Temporal StructureProceedings of the 24th ACM international conference on Multimedia10.1145/2964284.2967234(317-321)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1145/2964284.2967234

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents