Space-Time Tree Ensemble for Action Recognition and Localization

Shugao Ma ORCID: orcid.org/0000-0002-4986-2221¹,
Jianming Zhang²,
Stan Sclaroff¹,
Nazli Ikizler-Cinbis³ &
…
Leonid Sigal⁴

1707 Accesses
14 Citations
Explore all metrics

Abstract

Human actions are, inherently, structured patterns of body movements. We explore ensembles of hierarchical spatio-temporal trees, discovered directly from training data, to model these structures for action recognition and spatial localization. Discovery of frequent and discriminative tree structures is challenging due to the exponential search space, particularly if one allows partial matching. We address this by first building a concise action word vocabulary via discriminative clustering of the hierarchical space-time segments, which is a two-level video representation that captures both static and non-static relevant space-time segments of the video. Using this vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of discriminative tree patterns. Our experiments show that these tree patterns, alone, or in combination with shorter patterns (action words and pairwise patterns) achieve promising performance on three challenging datasets: UCF Sports, HighFive and Hollywood3D. Moreover, we perform cross-dataset validation, using trees learned on HighFive to recognize the same actions in Hollywood3D, and using trees learned on UCF-Sports to recognize and localize the similar actions in JHMDB. The results demonstrate the potential for cross-dataset generalization of the trees our approach discovers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Article 26 April 2016

Learning Action Primitives for Multi-level Video Event Understanding

Action-Gons: Action Recognition with a Discriminative Dictionary of Structured Elements with Varying Granularity

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

To avoid notation clutter, we omit the action class label a for \(\mathcal {T}\), \(\mathbf {w}\), \(\Phi \), \(\phi \) and \(\varphi \).
Note that we use notation \(\mathcal {T}\) to denote discovered tree structures of human actions, and notation \(\mathbf {T}\) to denote image segment trees from video frame hierarchical segmentation.
We did not find previous works reporting action classification and localization results for these individual action classes for comparison.

References

Aoun, N. B., Mejdoub, M., & Amar, C. B. (2014). Graph-based approach for human action recognition using spatio-temporal features. Journal of Visual Communication and Image Representation, 25(2), 329–338.
Article Google Scholar
Arbelaez, P., Maire, M., Fowlkes, C. C., Malik J. (2009). From contours to regions: An empirical evaluation. In CVPR.
Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. TPAMI, 23(3), 257–267.
Article Google Scholar
Brendel, W., Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In ICCV.
Cheáron, G., Laptev, I., Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV.
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265–292.
MATH Google Scholar
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. JMLR, 9, 1871–1874.
MATH Google Scholar
Felzenszwalb, P. F., & Zabih, R. (2011). Dynamic programming and graph algorithms in computer vision. TPAMI, 33(4), 721–740.
Article Google Scholar
Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315, 972–976.
Article MathSciNet MATH Google Scholar
Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. IJCV, 107(3), 219–238.
Article MathSciNet Google Scholar
Gilbert, A., Bowden, R. (2014). Data mining for action recognition. In ACCV.
Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. TPAMI, 33(5), 883–897.
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Gkioxari, G., Malik, J. (2015). Finding action tubes. In CVPR.
Gkioxari, G., Girshick, R., Malik, J. (2015). Contextual action recognition with R*CNN. In ICCV.
Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. TPAMI, 29(12), 2247–2253.
Article Google Scholar
Hadfield, S., Bowden, R. (2013). Hollywood 3D: Recognizing actions in 3D natural scenes. In CVPR.
Hadfield, S., Lebeda, K., Bowden, R. (2014). Natural action recognition using invariant 3D motion encoding. In ECCV.
Hoai, M., Zisserman, A. (2013). Discriminative sub-categorization. In CVPR.
Ikizler, N., & Forsyth, D. A. (2008). Searching for complex human activities with no visual examples. IJCV, 80(3), 337–357.
Article Google Scholar
Ikizler-Cinbis, N., Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In ECCV.
Iosifidis, A., Tefas, A., Pitas, I. (2014). Human action recognition based on bag of features and multi-view neural networks. In ICIP.
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M. J. (2013). Towards understanding action recognition. In ICCV.
Kantorov, V., Laptev, I. (2014). Efficient feature extraction, encoding, and classification for action recognition. In CVPR.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.
Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In ICCV.
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
Leordeanu, M., Sukthankar, R., Sminchisescu, C. (2012). Efficient closed-form solution to generalized boundary detection. In ECCV.
Ma, S., Zhang, J., Ikizler-Cinbis, N., Sclaroff, S. (2013). Action recognition and localization by hierarchical space-time segments. In ICCV.
Ma, S., Sigal, L., Sclaroff, S. (2015). Space-time tree ensemble for action recognition. In CVPR.
Marszałek, M., Laptev, I., Schmid, C. (2009). Actions in context. In CVPR.
Matikainen, P., Hebert, M., Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. In ECCV.
Mikolajczyk, K., & Uemura, H. (2011). Action recognition with appearance-motion features and fast search trees. CVIU, 115(3), 426–438.
Google Scholar
Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G. (2015). Beyond short snippets: deep networks for video classification. In CVPR.
Nijssen, S., Kok, J. N. (2005). A quickstart in frequent structure mining can make a difference. In ICCS.
Oneata, D., Verbeek, J., Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.
Patron-Perez, A., Marszalek, M., Zisserman, A., Reid, I. D. (2010). High five: Recognising human interactions in TV shows. In BMVC.
Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in TV shows. TPAMI, 34(12), 2441–2453.
Article Google Scholar
Perronnin, F., Sánchez, J., Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV.
Ramanan, D., Forsyth, D. A. (2003). Automatic annotation of everyday movements. In NIPS.
Raptis, M., Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In CVPR.
Raptis, M., Kokkinos, I., Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In CVPR.
Rodriguez, M. D., Ahmed, J., Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.
Sadanand, S., Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.
Simonyan, K., Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Tian, Y., Sukthankar, R., Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.
Todorovic, S. (2012). Human activities as stochastic kronecker graphs. In ECCV.
Tran, D., Yuan, J. (2011). Optimal spatio-temporal path discovery for video event detection. In CVPR.
Tran, D., Yuan, J. (2012). Max-margin structured output regression for spatio-temporal action localization. In NIPS.
Wang, H., Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.
Article MathSciNet Google Scholar
Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016). A robust and efficient video representation for action recognition. IJCV, 119(3), 219–238.
Article MathSciNet Google Scholar
Wang, L., Sahbi, H. (2013). Directed acyclic graph kernels for action recognition. In ICCV.
Wang, L., Qiao, Y., Tang, X. (2014). Video action detection with relational dynamic-poselets. In ECCV.
Wang, Y., Mori, G. (2008). Learning a discriminative hidden part model for human action recognition. In NIPS.
Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. TPAMI, 33(7), 1310–1323.
Article Google Scholar
Wang, Y., Huang, K., Tan, T. (2007). Human activity recognition based on r transform. In CVPR.
Wang, Y., Tran, D., Liao, Z., & Forsyth, D. (2012). Discriminative hierarchical part-based models for human parsing and action recognition. JMLR, 13, 30753102.
MathSciNet MATH Google Scholar
Weinland, D., Boyer, E., Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In ICCV.
Weinzaepfel, P., Harchaoui, Z., Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV.
Wu, B., Yuan, C., Hu, W. (2014). Human action recognition based on context-dependent graph kernels. In CVPR.
Wu, Z., Wang, X., Jiang, Y., Ye, H., Xue, X. (2015). Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia.
Xie, Y., Chang, H., Li, Z., Liang, L., Chen, X., Zhao, D. (2011). A unified framework for locating and recognizing human actions. In CVPR.
Yang, X., Tian, Y. (2014). Action recognition using super sparse coding vector with spatio-temporal awareness. In ECCV.
Zhang, H., Zhou, W., Reardon, C. M., Parker, L. E. (2014). Simplex-based 3D spatio-temporal feature description for action recognition. In CVPR.
Zitnick, C. L., Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision.

Download references

Acknowledgements

This work was supported in part through a Google Faculty Research Award and by US NSF grants 0855065, 0910908, and 1029430.

Author information

Authors and Affiliations

Computer Science, Boston University, Boston, MA, USA
Shugao Ma & Stan Sclaroff
Adobe Research, San Jose, CA, USA
Jianming Zhang
Computer Engineering, Hacettepe University, Ankara, Turkey
Nazli Ikizler-Cinbis
Disney Research, Pittsburgh, PA, USA
Leonid Sigal

Authors

Shugao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Stan Sclaroff
View author publications
You can also search for this author in PubMed Google Scholar
Nazli Ikizler-Cinbis
View author publications
You can also search for this author in PubMed Google Scholar
Leonid Sigal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shugao Ma.

Additional information

Communicated by Ivan Laptev and Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, S., Zhang, J., Sclaroff, S. et al. Space-Time Tree Ensemble for Action Recognition and Localization. Int J Comput Vis 126, 314–332 (2018). https://doi.org/10.1007/s11263-016-0980-8

Download citation

Received: 02 March 2016
Accepted: 08 December 2016
Published: 02 February 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11263-016-0980-8

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Learning Action Primitives for Multi-level Video Event Understanding

Action-Gons: Action Recognition with a Discriminative Dictionary of Structured Elements with Varying Granularity

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Space-Time Tree Ensemble for Action Recognition and Localization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Learning Action Primitives for Multi-level Video Event Understanding

Action-Gons: Action Recognition with a Discriminative Dictionary of Structured Elements with Varying Granularity

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation