Frame Segmentation Networks for Temporal Action Localization

Ke Yang¹⁸,
Peng Qiao¹⁸,
Qiang Wang¹⁸,
Shijie Li¹⁸,
Xin Niu¹⁸,
Dongsheng Li¹⁸ &
…
Yong Dou¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11165))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2545 Accesses

Abstract

Temporal action localization is an important task of computer vision. Though many methods have been proposed, it still remains an open question how to predict the temporal location of action segments precisely. Most state-of-the-art works train action classifiers on video segments pre-determined by action proposal. However, recent work found that a desirable model should move beyond segment-level and make dense predictions at a fine granularity in time to determine precise temporal boundaries. In this paper, we propose a Frame Segmentation Network (FSN) that places a temporal CNN on top of the 2D spatial CNNs. Spatial CNNs are responsible for abstracting semantics in spatial dimension while temporal CNN is responsible for introducing temporal context information and performing dense predictions. The proposed FSN can make dense predictions at frame-level for a video clip using both spatial and temporal context information. FSN is trained in an end-to-end manner, so the model can be optimized in spatial and temporal domain jointly. Experiment results on public dataset show that FSN achieves superior performance in both frame-level action localization and temporal action localization.

This work was supported by the National Basic Research Program of China (973) under Grant No.2014CB340303 and the National Natural Science Foundation of China under Grants U1435219, 61402507 and 61572515.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization

Multi-granularity transformer fusion for temporal action localization

Article 18 July 2024

Exploiting semantic-level affinities with a mask-guided network for temporal action proposal in videos

Article 22 November 2022

References

Caba Heilbron, F., Carlos Niebles, J., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: CVPR (2016)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
Google Scholar
Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47
Chapter Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Idrees, H., et al.: The THUMOS challenge on action recognition for videos “in the wild”. In: CVIU (2017)
Google Scholar
Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM Multimedia (2017)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)
Google Scholar
Oneata, D., Verbeek, J., Schmid, C.: The LEAR submission at THUMOS (2014)
Google Scholar
Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: CVPR (2016)
Google Scholar
Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR (2017)
Google Scholar
Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. In: THUMOS14 Action Recognition Challenge (2014)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Yang, K., Qiao, P., Li, D., Lv, S., Dou, Y.: Exploring temporal preservation networks for precise temporal action localization. In: AAAI (2018)
Google Scholar
Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738 (2015)
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)
Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yuan, J., Ni, B., Yang, X., Kassim, A.A.: Temporal action localization with pyramid of score distribution features. In: CVPR (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

National Laboratory for Parallel and Distributed Processing College of Computer, National University of Defense Technology, Changsha, China
Ke Yang, Peng Qiao, Qiang Wang, Shijie Li, Xin Niu, Dongsheng Li & Yong Dou

Authors

Ke Yang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shijie Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin Niu
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Yong Dou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ke Yang .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, K. et al. (2018). Frame Segmentation Networks for Temporal Action Localization. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11165. Springer, Cham. https://doi.org/10.1007/978-3-030-00767-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-00767-6_23
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00766-9
Online ISBN: 978-3-030-00767-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics