Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.01680 (cs)

[Submitted on 4 Apr 2022 (v1), last revised 26 Jul 2022 (this version, v2)]

Title:TALLFormer: Temporal Action Localization with a Long-memory Transformer

View PDF

Abstract:Most modern approaches in temporal action localization divide this problem into two parts: (i) short-term feature extraction and (ii) long-range temporal boundary localization. Due to the high GPU memory cost caused by processing long untrimmed videos, many methods sacrifice the representational power of the short-term feature extractor by either freezing the backbone or using a small spatial video resolution. This issue becomes even worse with the recent video transformer models, many of which have quadratic memory complexity. To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization Transformer with Long-term memory. Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration, thus, significantly reducing the GPU memory consumption and training time. These efficiency savings allow us (i) to use a powerful video transformer feature extractor without freezing the backbone or reducing the spatial video resolution, while (ii) also maintaining long-range temporal boundary localization capability. With only RGB frames as input and no external action recognition classifier, TALLFormer outperforms previous state-of-the-arts by a large margin, achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3. The code is public available: this https URL.

Comments:	Accepted by ECCV 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2204.01680 [cs.CV]
	(or arXiv:2204.01680v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.01680

Submission history

From: Feng Cheng [view email]
[v1] Mon, 4 Apr 2022 17:51:20 UTC (3,861 KB)
[v2] Tue, 26 Jul 2022 18:33:10 UTC (6,311 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TALLFormer: Temporal Action Localization with a Long-memory Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TALLFormer: Temporal Action Localization with a Long-memory Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators