Abstract
Various information streams, such as scene appearance and estimated movement of objects involved, can help in characterizing actions in videos. These information modalities perform better in different scenarios and complementary features can be combined to achieve superior results compared to the individual ones. As important as the definition of representative and complementary feature streams is the choice of proper combination strategies that explore the strengths of each aspect. In this work, we analyze different fusion approaches to combine complementary modalities. In order to define the best parameters of our fusion methods using the training set, we have to reduce overfitting in individual modalities, otherwise, the 100%-accurate outputs would not offer a realistic and relevant representation for the fusion method. Thus, we analyze an early stopping technique for training individual networks. In addition to reducing overfitting, this method also reduces the training cost, since it usually requires fewer epochs to complete the classification process. Experiments are conducted on UCF101 and HMDB51 datasets, which are two challenging benchmarks in the context of action recognition.
The authors are thankful to FAPESP (grants #2017/09160-1 and #2017/12646-3), CNPq (grants #305169/2015-7 and #309330/2018-7), CAPES and FAPEMIG for their financial support, and NVIDIA Corporation for the donation of a GPU as part of the GPU Grant Program.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alcantara, M.F., Pedrini, H., Cao, Y.: Human action classification based on silhouette indexed interest points for multiple domains. Int. J. Image Graph. 17(3), 1750018_1–1750018_27 (2017)
Alcantara, M.F., Moreira, T.P., Pedrini, H., Flórez-Revuelta, F.: Action identification using a descriptor with autonomous fragments in a multilevel prediction scheme. SIViP 11(2), 325–332 (2017)
Baumann, F., Ehlers, A., Rosenhahn, B., Liao, J.: Computation strategies for volume local binary patterns applied to action recognition. In: 11th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 68–73. IEEE (2014)
Brill, E., Wu, J.: Classifier combination for improved lexical disambiguation. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 191–195 (1998)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733. IEEE (2017)
Concha, D.T., et al.: Multi-stream convolutional neural networks for action recognition in video sequences based on adaptive visual rhythms. In: 17th IEEE International Conference on Machine Learning and Applications, pp. 473–480. IEEE (2018)
Cornejo, J.Y.R., Pedrini, H., Flórez-Revuelta, F.: Facial expression recognition with occlusions based on geometric representation. In: CIARP 2015. LNCS, vol. 9423, pp. 263–270. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25751-8_32
Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., Huang, J.: End-to-end learning of motion representation for video understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6016–6025 (2018)
Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Two stream LSTM: a deep fusion framework for human action recognition. In: IEEE Winter Conference on Applications of Computer Vision, pp. 177–186. IEEE (2017)
Gori, I., Aggarwal, J.K., Matthies, L., Ryoo, M.S.: Multitype activity recognition in robot-centric scenarios. IEEE Robot. Autom. Lett. 1(1), 593–600 (2016)
Hommos, O., Pintea, S.L., Mettes, P.S., van Gemert, J.C.: Using phase instead of optical flow for action recognition. arXiv preprint arXiv:1809.03258 (2018)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: HMDB51: a large video database for human motion recognition. In: Nagel, W., Kröner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering, pp. 571–582. Springer, Berlin (2013)
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Ma, C.Y., Chen, M.H., Kira, Z., AlRegib, G.: TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Signal Process.: Image Commun. 71, 76–87 (2019)
Moreira, T., Menotti, D., Pedrini, H.: First-person action recognition through visual rhythm texture description. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2627–2631. IEEE (2017)
Murofushi, T., Sugeno, M.: Fuzzy measures and fuzzy integrals. In: Grabisch, M., Murofushi, T., Sugeno, M. (eds.) Fuzzy Measures and Integrals - Theory and Applications, pp. 3–41. Physica Verlag, Heidelberg (2000)
Murofushi, T., Sugeno, M.: An interpretation of fuzzy measures and the Choquet integral as an integral with respect to a fuzzy measure. Fuzzy Sets Syst. 29(2), 201–227 (1989)
Nanni, L., Brahnam, S., Lumini, A.: Local ternary patterns from three orthogonal planes for human action classification. Expert Syst. Appl. 38(5), 5125–5128 (2011)
Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Peng, X., Zou, C., Qiao, Yu., Peng, Q.: Action recognition with stacked fisher vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 581–595. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_38
Prechelt, L.: Early stopping - but when? In: Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 1524, pp. 55–69. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49430-8_3
Ryoo, M.S., Matthies, L.: First-person activity recognition: feature, temporal structure, and prediction. Int. J. Comput. Vis. 119(3), 307–328 (2015). https://doi.org/10.1007/s11263-015-0847-4
Santos, A., Pedrini, H.: Spatio-temporal video autoencoder for human action recognition. In: 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Prague, Czech Republic, pp. 114–123, February 2019
e Santos, A.C.S., de Almeida Maia, H., e Souza, M.R., Vieira, M.B., Pedrini, H.: Fuzzy fusion for two-stream action recognition. In: VISIGRAPP (5: VISAPP), pp. 117–123 (2020)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Shi, F., Laganiere, R., Petriu, E.: Gradient boundary histograms for action recognition. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1107–1114 (Jan 2015)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7912–7921 (2019)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 568–576. Curran Associates, Inc. (2014)
Snoek, C.G., Worring, M., Smeulders, A.W.: Early versus late fusion in aemantic video analysis. In: 13th annual ACM International Conference on Multimedia, pp. 399–402 (2005)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Tahani, H., Keller, J.M.: Information fusion in computer vision using the fuzzy integral. IEEE Trans. Syst. Man Cybern. 20(3), 733–741 (1990)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Uddin, M.A., Joolee, J.B., Alam, A., Lee, Y.K.: Human action recognition using adaptive local motion descriptor in spark. IEEE Access 5, 21157–21167 (2017)
Wang, H., Wang, W., Wang, L.: Hierarchical motion evolution for action recognition. In: Asian Conference on Pattern Recognition, pp. 574–578, November 2015
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision, pp. 3551–3558, December 2013
Wang, L., Ge, L., Li, R., Fang, Y.: Three-stream CNNs for action recognition. Pattern Recogn. Lett. 92, 33–40 (2017)
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3-D ConvNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 20(3), 634–644 (2017)
Xu, J., Tasaka, K., Yanagihara, H.: Beyond two-stream: skeleton-based three-stream networks for action recognition in videos. In: 24th International Conference on Pattern Recognition, pp. 1567–1573. IEEE (2018)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)
Yi, Y., Zheng, Z., Lin, M.: Realistic action recognition with salient foreground trajectories. Expert Syst. Appl. 75, 44–55 (2017)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L\(^1\) optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22
Zhang, H.B., et al.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5), 1005 (2019)
Zhu, J., Zhu, Z., Zou, W.: End-to-end video-level representation learning for action recognition. In: 24th International Conference on Pattern Recognition, pp. 645–650. IEEE (2018)
Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Zhu, Y.: PyTorch implementation of popular two-stream frameworks for video action recognition (2017). https://github.com/bryanyzhu/two-stream-pytorch
Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A.G.: Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
de Almeida Maia, H., e Souza, M.R., e Santos, A.C.S., Bobadilla, J.C.M., Vieira, M.B., Pedrini, H. (2022). Early Stopping for Two-Stream Fusion Applied to Action Recognition. In: Bouatouch, K., et al. Computer Vision, Imaging and Computer Graphics Theory and Applications. VISIGRAPP 2020. Communications in Computer and Information Science, vol 1474. Springer, Cham. https://doi.org/10.1007/978-3-030-94893-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-94893-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-94892-4
Online ISBN: 978-3-030-94893-1
eBook Packages: Computer ScienceComputer Science (R0)