Abstract
Human actions are, inherently, structured patterns of body movements. We explore ensembles of hierarchical spatio-temporal trees, discovered directly from training data, to model these structures for action recognition and spatial localization. Discovery of frequent and discriminative tree structures is challenging due to the exponential search space, particularly if one allows partial matching. We address this by first building a concise action word vocabulary via discriminative clustering of the hierarchical space-time segments, which is a two-level video representation that captures both static and non-static relevant space-time segments of the video. Using this vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of discriminative tree patterns. Our experiments show that these tree patterns, alone, or in combination with shorter patterns (action words and pairwise patterns) achieve promising performance on three challenging datasets: UCF Sports, HighFive and Hollywood3D. Moreover, we perform cross-dataset validation, using trees learned on HighFive to recognize the same actions in Hollywood3D, and using trees learned on UCF-Sports to recognize and localize the similar actions in JHMDB. The results demonstrate the potential for cross-dataset generalization of the trees our approach discovers.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
To avoid notation clutter, we omit the action class label a for \(\mathcal {T}\), \(\mathbf {w}\), \(\Phi \), \(\phi \) and \(\varphi \).
Note that we use notation \(\mathcal {T}\) to denote discovered tree structures of human actions, and notation \(\mathbf {T}\) to denote image segment trees from video frame hierarchical segmentation.
We did not find previous works reporting action classification and localization results for these individual action classes for comparison.
References
Aoun, N. B., Mejdoub, M., & Amar, C. B. (2014). Graph-based approach for human action recognition using spatio-temporal features. Journal of Visual Communication and Image Representation, 25(2), 329–338.
Arbelaez, P., Maire, M., Fowlkes, C. C., Malik J. (2009). From contours to regions: An empirical evaluation. In CVPR.
Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. TPAMI, 23(3), 257–267.
Brendel, W., Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In ICCV.
Cheáron, G., Laptev, I., Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV.
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265–292.
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. JMLR, 9, 1871–1874.
Felzenszwalb, P. F., & Zabih, R. (2011). Dynamic programming and graph algorithms in computer vision. TPAMI, 33(4), 721–740.
Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315, 972–976.
Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. IJCV, 107(3), 219–238.
Gilbert, A., Bowden, R. (2014). Data mining for action recognition. In ACCV.
Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. TPAMI, 33(5), 883–897.
Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Gkioxari, G., Malik, J. (2015). Finding action tubes. In CVPR.
Gkioxari, G., Girshick, R., Malik, J. (2015). Contextual action recognition with R*CNN. In ICCV.
Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. TPAMI, 29(12), 2247–2253.
Hadfield, S., Bowden, R. (2013). Hollywood 3D: Recognizing actions in 3D natural scenes. In CVPR.
Hadfield, S., Lebeda, K., Bowden, R. (2014). Natural action recognition using invariant 3D motion encoding. In ECCV.
Hoai, M., Zisserman, A. (2013). Discriminative sub-categorization. In CVPR.
Ikizler, N., & Forsyth, D. A. (2008). Searching for complex human activities with no visual examples. IJCV, 80(3), 337–357.
Ikizler-Cinbis, N., Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In ECCV.
Iosifidis, A., Tefas, A., Pitas, I. (2014). Human action recognition based on bag of features and multi-view neural networks. In ICIP.
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M. J. (2013). Towards understanding action recognition. In ICCV.
Kantorov, V., Laptev, I. (2014). Efficient feature extraction, encoding, and classification for action recognition. In CVPR.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.
Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In ICCV.
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
Leordeanu, M., Sukthankar, R., Sminchisescu, C. (2012). Efficient closed-form solution to generalized boundary detection. In ECCV.
Ma, S., Zhang, J., Ikizler-Cinbis, N., Sclaroff, S. (2013). Action recognition and localization by hierarchical space-time segments. In ICCV.
Ma, S., Sigal, L., Sclaroff, S. (2015). Space-time tree ensemble for action recognition. In CVPR.
Marszałek, M., Laptev, I., Schmid, C. (2009). Actions in context. In CVPR.
Matikainen, P., Hebert, M., Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. In ECCV.
Mikolajczyk, K., & Uemura, H. (2011). Action recognition with appearance-motion features and fast search trees. CVIU, 115(3), 426–438.
Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G. (2015). Beyond short snippets: deep networks for video classification. In CVPR.
Nijssen, S., Kok, J. N. (2005). A quickstart in frequent structure mining can make a difference. In ICCS.
Oneata, D., Verbeek, J., Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.
Patron-Perez, A., Marszalek, M., Zisserman, A., Reid, I. D. (2010). High five: Recognising human interactions in TV shows. In BMVC.
Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in TV shows. TPAMI, 34(12), 2441–2453.
Perronnin, F., Sánchez, J., Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV.
Ramanan, D., Forsyth, D. A. (2003). Automatic annotation of everyday movements. In NIPS.
Raptis, M., Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In CVPR.
Raptis, M., Kokkinos, I., Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In CVPR.
Rodriguez, M. D., Ahmed, J., Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.
Sadanand, S., Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.
Simonyan, K., Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Tian, Y., Sukthankar, R., Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.
Todorovic, S. (2012). Human activities as stochastic kronecker graphs. In ECCV.
Tran, D., Yuan, J. (2011). Optimal spatio-temporal path discovery for video event detection. In CVPR.
Tran, D., Yuan, J. (2012). Max-margin structured output regression for spatio-temporal action localization. In NIPS.
Wang, H., Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.
Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016). A robust and efficient video representation for action recognition. IJCV, 119(3), 219–238.
Wang, L., Sahbi, H. (2013). Directed acyclic graph kernels for action recognition. In ICCV.
Wang, L., Qiao, Y., Tang, X. (2014). Video action detection with relational dynamic-poselets. In ECCV.
Wang, Y., Mori, G. (2008). Learning a discriminative hidden part model for human action recognition. In NIPS.
Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. TPAMI, 33(7), 1310–1323.
Wang, Y., Huang, K., Tan, T. (2007). Human activity recognition based on r transform. In CVPR.
Wang, Y., Tran, D., Liao, Z., & Forsyth, D. (2012). Discriminative hierarchical part-based models for human parsing and action recognition. JMLR, 13, 30753102.
Weinland, D., Boyer, E., Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In ICCV.
Weinzaepfel, P., Harchaoui, Z., Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV.
Wu, B., Yuan, C., Hu, W. (2014). Human action recognition based on context-dependent graph kernels. In CVPR.
Wu, Z., Wang, X., Jiang, Y., Ye, H., Xue, X. (2015). Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia.
Xie, Y., Chang, H., Li, Z., Liang, L., Chen, X., Zhao, D. (2011). A unified framework for locating and recognizing human actions. In CVPR.
Yang, X., Tian, Y. (2014). Action recognition using super sparse coding vector with spatio-temporal awareness. In ECCV.
Zhang, H., Zhou, W., Reardon, C. M., Parker, L. E. (2014). Simplex-based 3D spatio-temporal feature description for action recognition. In CVPR.
Zitnick, C. L., Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision.
Acknowledgements
This work was supported in part through a Google Faculty Research Award and by US NSF grants 0855065, 0910908, and 1029430.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ivan Laptev and Cordelia Schmid.
Rights and permissions
About this article
Cite this article
Ma, S., Zhang, J., Sclaroff, S. et al. Space-Time Tree Ensemble for Action Recognition and Localization. Int J Comput Vis 126, 314–332 (2018). https://doi.org/10.1007/s11263-016-0980-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-016-0980-8