Early Stopping for Two-Stream Fusion Applied to Action Recognition

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1474))

Included in the following conference series:

International Joint Conference on Computer Vision, Imaging and Computer Graphics

944 Accesses

Abstract

Various information streams, such as scene appearance and estimated movement of objects involved, can help in characterizing actions in videos. These information modalities perform better in different scenarios and complementary features can be combined to achieve superior results compared to the individual ones. As important as the definition of representative and complementary feature streams is the choice of proper combination strategies that explore the strengths of each aspect. In this work, we analyze different fusion approaches to combine complementary modalities. In order to define the best parameters of our fusion methods using the training set, we have to reduce overfitting in individual modalities, otherwise, the 100%-accurate outputs would not offer a realistic and relevant representation for the fusion method. Thus, we analyze an early stopping technique for training individual networks. In addition to reducing overfitting, this method also reduces the training cost, since it usually requires fewer epochs to complete the classification process. Experiments are conducted on UCF101 and HMDB51 datasets, which are two challenging benchmarks in the context of action recognition.

The authors are thankful to FAPESP (grants #2017/09160-1 and #2017/12646-3), CNPq (grants #305169/2015-7 and #309330/2018-7), CAPES and FAPEMIG for their financial support, and NVIDIA Corporation for the donation of a GPU as part of the GPU Grant Program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Spatio-Temporal Fusion Networks for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Article 28 October 2019

Modality Distillation with Multiple Stream Networks for Action Recognition

References

Alcantara, M.F., Pedrini, H., Cao, Y.: Human action classification based on silhouette indexed interest points for multiple domains. Int. J. Image Graph. 17(3), 1750018_1–1750018_27 (2017)
Google Scholar
Alcantara, M.F., Moreira, T.P., Pedrini, H., Flórez-Revuelta, F.: Action identification using a descriptor with autonomous fragments in a multilevel prediction scheme. SIViP 11(2), 325–332 (2017)
Article Google Scholar
Baumann, F., Ehlers, A., Rosenhahn, B., Liao, J.: Computation strategies for volume local binary patterns applied to action recognition. In: 11th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 68–73. IEEE (2014)
Google Scholar
Brill, E., Wu, J.: Classifier combination for improved lexical disambiguation. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 191–195 (1998)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733. IEEE (2017)
Google Scholar
Concha, D.T., et al.: Multi-stream convolutional neural networks for action recognition in video sequences based on adaptive visual rhythms. In: 17th IEEE International Conference on Machine Learning and Applications, pp. 473–480. IEEE (2018)
Google Scholar
Cornejo, J.Y.R., Pedrini, H., Flórez-Revuelta, F.: Facial expression recognition with occlusions based on geometric representation. In: CIARP 2015. LNCS, vol. 9423, pp. 263–270. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25751-8_32
Chapter Google Scholar
Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., Huang, J.: End-to-end learning of motion representation for video understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6016–6025 (2018)
Google Scholar
Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Two stream LSTM: a deep fusion framework for human action recognition. In: IEEE Winter Conference on Applications of Computer Vision, pp. 177–186. IEEE (2017)
Google Scholar
Gori, I., Aggarwal, J.K., Matthies, L., Ryoo, M.S.: Multitype activity recognition in robot-centric scenarios. IEEE Robot. Autom. Lett. 1(1), 593–600 (2016)
Article Google Scholar
Hommos, O., Pintea, S.L., Mettes, P.S., van Gemert, J.C.: Using phase instead of optical flow for action recognition. arXiv preprint arXiv:1809.03258 (2018)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Google Scholar
Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: HMDB51: a large video database for human motion recognition. In: Nagel, W., Kröner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering, pp. 571–582. Springer, Berlin (2013)
Google Scholar
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Article Google Scholar
Ma, C.Y., Chen, M.H., Kira, Z., AlRegib, G.: TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Signal Process.: Image Commun. 71, 76–87 (2019)
Google Scholar
Moreira, T., Menotti, D., Pedrini, H.: First-person action recognition through visual rhythm texture description. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2627–2631. IEEE (2017)
Google Scholar
Murofushi, T., Sugeno, M.: Fuzzy measures and fuzzy integrals. In: Grabisch, M., Murofushi, T., Sugeno, M. (eds.) Fuzzy Measures and Integrals - Theory and Applications, pp. 3–41. Physica Verlag, Heidelberg (2000)
MATH Google Scholar
Murofushi, T., Sugeno, M.: An interpretation of fuzzy measures and the Choquet integral as an integral with respect to a fuzzy measure. Fuzzy Sets Syst. 29(2), 201–227 (1989)
Article MathSciNet MATH Google Scholar
Nanni, L., Brahnam, S., Lumini, A.: Local ternary patterns from three orthogonal planes for human action classification. Expert Syst. Appl. 38(5), 5125–5128 (2011)
Article Google Scholar
Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Google Scholar
Peng, X., Zou, C., Qiao, Yu., Peng, Q.: Action recognition with stacked fisher vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 581–595. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_38
Chapter Google Scholar
Prechelt, L.: Early stopping - but when? In: Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 1524, pp. 55–69. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49430-8_3
Chapter Google Scholar
Ryoo, M.S., Matthies, L.: First-person activity recognition: feature, temporal structure, and prediction. Int. J. Comput. Vis. 119(3), 307–328 (2015). https://doi.org/10.1007/s11263-015-0847-4
Article MathSciNet Google Scholar
Santos, A., Pedrini, H.: Spatio-temporal video autoencoder for human action recognition. In: 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Prague, Czech Republic, pp. 114–123, February 2019
Google Scholar
e Santos, A.C.S., de Almeida Maia, H., e Souza, M.R., Vieira, M.B., Pedrini, H.: Fuzzy fusion for two-stream action recognition. In: VISIGRAPP (5: VISAPP), pp. 117–123 (2020)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Shi, F., Laganiere, R., Petriu, E.: Gradient boundary histograms for action recognition. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1107–1114 (Jan 2015)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7912–7921 (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 568–576. Curran Associates, Inc. (2014)
Google Scholar
Snoek, C.G., Worring, M., Smeulders, A.W.: Early versus late fusion in aemantic video analysis. In: 13th annual ACM International Conference on Multimedia, pp. 399–402 (2005)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Tahani, H., Keller, J.M.: Information fusion in computer vision using the fuzzy integral. IEEE Trans. Syst. Man Cybern. 20(3), 733–741 (1990)
Article Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Uddin, M.A., Joolee, J.B., Alam, A., Lee, Y.K.: Human action recognition using adaptive local motion descriptor in spark. IEEE Access 5, 21157–21167 (2017)
Article Google Scholar
Wang, H., Wang, W., Wang, L.: Hierarchical motion evolution for action recognition. In: Asian Conference on Pattern Recognition, pp. 574–578, November 2015
Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision, pp. 3551–3558, December 2013
Google Scholar
Wang, L., Ge, L., Li, R., Fang, Y.: Three-stream CNNs for action recognition. Pattern Recogn. Lett. 92, 33–40 (2017)
Article Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3-D ConvNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 20(3), 634–644 (2017)
Article Google Scholar
Xu, J., Tasaka, K., Yanagihara, H.: Beyond two-stream: skeleton-based three-stream networks for action recognition in videos. In: 24th International Conference on Pattern Recognition, pp. 1567–1573. IEEE (2018)
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Yi, Y., Zheng, Z., Lin, M.: Realistic action recognition with salient foreground trajectories. Expert Syst. Appl. 75, 44–55 (2017)
Article Google Scholar
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L\(^1\) optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22
Chapter Google Scholar
Zhang, H.B., et al.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5), 1005 (2019)
Article Google Scholar
Zhu, J., Zhu, Z., Zou, W.: End-to-end video-level representation learning for action recognition. In: 24th International Conference on Pattern Recognition, pp. 645–650. IEEE (2018)
Google Scholar
Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Zhu, Y.: PyTorch implementation of popular two-stream frameworks for video action recognition (2017). https://github.com/bryanyzhu/two-stream-pytorch
Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A.G.: Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389 (2017)

Download references

Author information

Authors and Affiliations

Institute of Computing, University of Campinas, Campinas, SP, 13083-852, Brazil
Helena de Almeida Maia, Marcos Roberto e Souza, Anderson Carlos Sousa e Santos, Julio Cesar Mendoza Bobadilla & Helio Pedrini
Department of Computer Science, Federal University of Juiz de Fora, Juiz de Fora, MG, 36036-900, Brazil
Marcelo Bernardes Vieira

Authors

Helena de Almeida Maia
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Roberto e Souza
View author publications
You can also search for this author in PubMed Google Scholar
Anderson Carlos Sousa e Santos
View author publications
You can also search for this author in PubMed Google Scholar
Julio Cesar Mendoza Bobadilla
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Bernardes Vieira
View author publications
You can also search for this author in PubMed Google Scholar
Helio Pedrini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Helio Pedrini .

Editor information

Editors and Affiliations

IRISA, University of Rennes 1, Rennes, France
Kadi Bouatouch
Universidade do Porto, Porto, Portugal
A. Augusto de Sousa
University of Genova, Genova, Italy
Manuela Chessa
Mines ParisTech, Paris, France
Alexis Paljic
Linnaeus University, Växjö, Sweden
Andreas Kerren
French Civil Aviation University (ENAC), Toulouse, France
Christophe Hurter
Università di Catania, Catania, Italy
Giovanni Maria Farinella
Universitat de Barcelona, Barcelona, Spain
Petia Radeva
Escola Superior de Tecnologia de Setúbal, Setúbal, Portugal
Jose Braz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Almeida Maia, H., e Souza, M.R., e Santos, A.C.S., Bobadilla, J.C.M., Vieira, M.B., Pedrini, H. (2022). Early Stopping for Two-Stream Fusion Applied to Action Recognition. In: Bouatouch, K., et al. Computer Vision, Imaging and Computer Graphics Theory and Applications. VISIGRAPP 2020. Communications in Computer and Information Science, vol 1474. Springer, Cham. https://doi.org/10.1007/978-3-030-94893-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-94893-1_14
Published: 22 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-94892-4
Online ISBN: 978-3-030-94893-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics