More Web Proxy on the site http://driver.im/

research-article

Self-attention-based long temporal sequence modeling method for temporal action detection

Authors:

Jie LinAuthors Info & Claims

Volume 554, Issue C

https://doi.org/10.1016/j.neucom.2023.126617

Published: 14 October 2023 Publication History

Abstract

Temporal Action Detection (TAD) is a basic and complex task in video understanding. It aims at detecting both the localization and category of actions in a video. The anchor-free TAD methods directly predict the action classes at each location and regress the distances to the boundaries. However, current anchor-free models based on convolutional neural network encode spatiotemporal sequences by 3D convolution networks. Due to the limited receptive field and the basic prior of the translation invariance, effective long temporal sequence modeling cannot be achieved. As a result, these methods cannot effectively detect temporal boundaries. To solve this problem, we design a novel end-to-end self-attention temporal enhancement TAD model, which introduces the Temporal Enhancement module to enhance the temporal feature encoding of the videos and expand the receptive field. Extensive experiments demonstrate that the self-attention Temporal Enhancement model yields an effective improvement on previous work, which improves the performance on THUMOS14 by 1.2%, reaching 53.2% on average mAP. Meanwhile, a competitive result of 34.7% average mAP is achieved on ActivityNet-1.3.

Highlights

•

Model long temporal sequence for TAD.

•

Build a novel framework to model temporal semantics.

•

Save computational resource overhead for untrimmed videos in TAD.

•

Expensive experiments prove the effectiveness of our method.

References

[1]

Hu Xuejiao, Dai Jingzhao, Li Ming, Peng Chenglei, Li Yang, Du Sidan, Online human action detection and anticipation in videos: A survey, Neurocomputing 491 (2022) 395–413.

[2]

Xia Huifen, Zhan Yongzhao, A survey on temporal action localization, IEEE Access 8 (2020) 70477–70487.

[3]

Kong Yu, Fu Yun, Human action recognition and prediction: A survey, Int. J. Comput. Vis. 130 (5) (2022) 1366–1401.

[4]

Zang Sha-Sha, Yu Hui, Song Yan, Zeng Ru, Unsupervised video summarization using deep non-local video summarization networks, Neurocomputing 519 (2023) 26–35.

[5]

Huijuan Xu, Abir Das, Kate Saenko, R-c3d: Region convolutional 3d network for temporal activity detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5783–5792.

[6]

Lin Tianwei, Zhao Xu, Su Haisheng, Wang Chongjing, Yang Ming, Bsn: Boundary sensitive network for temporal action proposal generation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.

[7]

Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, Shih-Fu Chang, Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5734–5743.

[8]

Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen, Bmn: Boundary-matching network for temporal action proposal generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3889–3898.

[9]

Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, Tao Mei, Gaussian temporal awareness networks for action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 344–353.

[10]

Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu, Learning salient boundary feature for anchor-free temporal action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3320–3329.

[11]

Jing Tan, Jiaqi Tang, Limin Wang, Gangshan Wu, Relaxed transformer decoders for direct action proposal generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13526–13535.

[12]

Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, Rahul Sukthankar, Rethinking the faster r-cnn architecture for temporal action localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1130–1139.

[13]

Qinying Liu, Zilei Wang, Progressive boundary refinement network for temporal action detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 11612–11619.

[14]

Joao Carreira, Andrew Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.

[15]

Simoncelli Eero P., Olshausen Bruno A., Natural image statistics and neural representation, Annu. Rev. Neurosci. 24 (1) (2001) 1193–1216.

[16]

Naseer Muhammad Muzammal, Ranasinghe Kanchana, Khan Salman H., Hayat Munawar, Shahbaz Khan Fahad, Yang Ming-Hsuan, Intriguing properties of vision transformers, Adv. Neural Inf. Process. Syst. 34 (2021) 23296–23308.

[17]

Sun Che, Song Hao, Wu Xinxiao, Jia Yunde, Luo Jiebo, Exploiting informative video segments for temporal action localization, IEEE Trans. Multimed. 24 (2021) 274–287.

[18]

Zhong Chongyang, Hu Lei, Xia Shihong, Spatial–temporal modeling for prediction of stylized human motion, Neurocomputing 511 (2022) 34–42.

[19]

Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, Bernard Ghanem, G-tad: Sub-graph localization for temporal action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10156–10165.

[20]

Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, Gan Chuang, Graph convolutional module for temporal action localization in videos, IEEE Trans. Pattern Anal. Mach. Intell. (2021).

[21]

Wang Le, Zhai Changbo, Zhang Qilin, Tang Wei, Zheng Nanning, Hua Gang, Graph-based temporal action co-localization from an untrimmed video, Neurocomputing 434 (2021) 211–223.

[22]

Bengio Yoshua, Simard Patrice, Frasconi Paolo, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw. 5 (2) (1994) 157–166.

Digital Library

[23]

Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, Yanbo Gao, Independently recurrent neural network (indrnn): Building a longer and deeper rnn, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5457–5466.

[24]

Seo Younjoo, Loukas Andreas, Perraudin Nathanaël, Discriminative structural graph classification, 2019, CoRR, arXiv:1905.13422.

[25]

Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, Polosukhin Illia, Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017).

Digital Library

[26]

Liu Fang, Kong Yuqiu, Zhang Lihe, Feng Guang, Yin Baocai, Local-global coordination with transformers for referring image segmentation, Neurocomputing 522 (2023) 39–52.

Digital Library

[27]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko, End-to-end object detection with transformers, in: European Conference on Computer Vision, 2020, pp. 213–229.

[28]

Park Namuk, Kim Songkuk, How do vision transformers work?, 2022, arXiv preprint arXiv:2202.06709.

[29]

Feng Cheng, Gedas Bertasius, TallFormer: Temporal Action Localization with a Long-Memory Transformer, in: Proceedings of the European Conference on Computer Vision, ECCV, 2022, pp. 503–521.

[30]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, Han Hu, Video swin transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211.

[31]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.

[32]

Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, Rongrong Ji, Fast learning of temporal action proposal via dense boundary generator, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 11499–11506.

[33]

Qin Xin, Zhao Hanbin, Lin Guangchen, Zeng Hao, Xu Songcen, Li Xi, PcmNet: Position-sensitive context modeling network for temporal action localization, Neurocomputing 510 (2022) 48–58.

[34]

Bernard Ghanem Shyamal Buch, Juan Carlos Niebles, End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos, in: Proceedings of the British Machine Vision Conference, BMVC, 2017, pp. 93.1–93.12.

[35]

Tianwei Lin, Xu Zhao, Zheng Shou, Single shot temporal action detection, in: Proceedings of the 25th ACM International Conference on Multimedia, 2017, pp. 988–996.

[36]

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, Dahua Lin, Temporal action detection with structured segment networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2914–2923.

[37]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg, Ssd: Single shot multibox detector, in: European Conference on Computer Vision, 2016, pp. 21–37.

[38]

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.

[39]

Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, Shih-Fu Chang, Multi-granularity generator for temporal action proposal, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3604–3613.

[40]

Jialin Gao, Zhixiang Shi, Guanshuo Wang, Jiani Li, Yufeng Yuan, Shiming Ge, Xi Zhou, Accurate temporal action proposal generation with relation-aware pyramid network, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 10810–10817.

[41]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.

[42]

Bertasius Gedas, Wang Heng, Torresani Lorenzo, Is space-time attention all you need for video understanding?, in: ICML, vol. 2, 2021, p. 4.

[43]

Wang Wenhai, Xie Enze, Li Xiang, Fan Deng-Ping, Song Kaitao, Liang Ding, Lu Tong, Luo Ping, Shao Ling, Pvt v2: Improved baselines with pyramid vision transformer, Comput. Vis. Media 8 (3) (2022) 415–424.

[44]

Cheng Bowen, Schwing Alex, Kirillov Alexander, Per-pixel classification is not all you need for semantic segmentation, Adv. Neural Inf. Process. Syst. 34 (2021) 17864–17875.

[45]

Xie Enze, Wang Wenhai, Yu Zhiding, Anandkumar Anima, Alvarez Jose M., Luo Ping, SegFormer: Simple and efficient design for semantic segmentation with transformers, Adv. Neural Inf. Process. Syst. 34 (2021) 12077–12090.

[46]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid, Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846.

[47]

Ryoo Michael, Piergiovanni A.J., Arnab Anurag, Dehghani Mostafa, Angelova Anelia, Tokenlearner: Adaptive space-time tokenization for videos, Adv. Neural Inf. Process. Syst. 34 (2021) 12786–12797.

[48]

Idrees Haroon, Zamir Amir R., Jiang Yu-Gang, Gorban Alex, Laptev Ivan, Sukthankar Rahul, Shah Mubarak, The THUMOS challenge on action recognition for videos “in the wild”, Comput. Vis. Image Underst. 155 (2017) 1–23.

[49]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.

[50]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan, Graph convolutional networks for temporal action localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7094–7103.

[51]

Chen Zhao, Ali K. Thabet, Bernard Ghanem, Video self-stitching graph network for temporal action localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13658–13667.

[52]

Yang Le, Peng Houwen, Zhang Dingwen, Fu Jianlong, Han Junwei, Revisiting anchor mechanisms for temporal action localization, IEEE Trans. Image Process. 29 (2020) 8535–8548.

[53]

Liu Xiaolong, Wang Qimeng, Hu Yao, Tang Xu, Zhang Shiwei, Bai Song, Bai Xiang, End-to-end temporal action detection with transformer, IEEE Trans. Image Process. 31 (2022) 5427–5441.

[54]

Xin Li, Tianwei Lin, Xiao Liu, Wangmeng Zuo, Chao Li, Xiang Long, Dongliang He, Fu Li, Shilei Wen, Chuang Gan, Deep concept-wise temporal convolutional networks for action localization, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4004–4012.

[55]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan, Graph convolutional networks for temporal action localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7094–7103.

Cited By

Wu LXu L(2024)Local and global context cooperation for temporal action detectionMultimedia Systems10.1007/s00530-024-01511-930:6Online publication date: 6-Nov-2024
https://dl.acm.org/doi/10.1007/s00530-024-01511-9

Index Terms

Self-attention-based long temporal sequence modeling method for temporal action detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Index terms have been assigned to the content through auto-classification.

Recommendations

Temporal Relation-Aware Global Attention Network for Temporal Action Detection
Advanced Intelligent Computing Technology and Applications
Abstract
Temporal Action Detection (TAD) is a crucial task in video understanding. Its primary objective is to accurately identify the semantic labels of each action instance in an untrimmed video, along with their temporal range. This paper constructs the ...
Temporal Action Detection with Structured Segment Networks
Abstract
This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of ...
Single Shot Temporal Action Detection
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 554, Issue C

Oct 2023

307 pages

ISSN:0925-2312

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 14 October 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu LXu L(2024)Local and global context cooperation for temporal action detectionMultimedia Systems10.1007/s00530-024-01511-930:6Online publication date: 6-Nov-2024
https://dl.acm.org/doi/10.1007/s00530-024-01511-9

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents