Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.07942 (cs)

[Submitted on 15 Jan 2024]

Title:Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Authors:Morteza Moradi, Simone Palazzo, Concetto Spampinato

View PDF

Abstract:In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features' dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.

Comments:	8 pages, 2 figures, 3 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2401.07942 [cs.CV]
	(or arXiv:2401.07942v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.07942

Submission history

From: Morteza Moradi [view email]
[v1] Mon, 15 Jan 2024 20:09:56 UTC (766 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators