research-article

High performance RGB-Thermal Video Object Detection via hybrid fusion with progressive interaction and temporal-modal difference

Authors:

Qishun Wang,

Zhengzheng Tu,

Chenglong Li,

Jin TangAuthors Info & Claims

Volume 114, Issue C

https://doi.org/10.1016/j.inffus.2024.102665

Published: 01 February 2025 Publication History

Abstract

RGB-Thermal Video Object Detection (RGBT VOD) is to localize and classify the predefined objects in visible and thermal spectrum videos. The key issue in RGBT VOD lies in integrating multi-modal information effectively to improve detection performance. Current multi-modal fusion methods predominantly employ middle fusion strategies, but the inherent modal difference directly influences the effect of multi-modal fusion. Although the early fusion strategy reduces the modality gap in the middle stage of the network, achieving in-depth feature interaction between different modalities remains challenging. In this work, we propose a novel hybrid fusion network called PTMNet, which effectively combines the early fusion strategy with the progressive interaction and the middle fusion strategy with the temporal-modal difference, for high performance RGBT VOD. In particular, we take each modality as a master modality to achieve an early fusion with other modalities as auxiliary information by progressive interaction. Such a design not only alleviates the modality gap but facilitates middle fusion. The temporal-modal difference models temporal information through spatial offsets and utilizes feature erasure between modalities to motivate the network to focus on shared objects in both modalities. The hybrid fusion can achieve high detection accuracy only using three input frames, which makes our PTMNet achieve a high inference speed. Experimental results show that our approach achieves state-of-the-art performance on the VT-VOD50 dataset and also operates at over 70 FPS. The code will be freely released at https://github.com/tzz-ahu for academic purposes.

Highlights

•

A hybrid fusion strategy network for RGB-Thermal video object detection.

•

An early strategy for reducing modal disparities.

•

A novel differential method for modeling multimodal and temporal information.

•

The proposed PTMNet achieves SOTA performance on the VT-VOD50 dataset.

References

[1]

Tu Zhengzheng, Wang Qishun, Wang Hongshun, Wang Kunpeng, Li Chenglong, Erasure-based interaction network for RGBT video object detection and a unified benchmark, 2023, arXiv preprint arXiv:2308.01630.

Abstract

Highlights

References

Index Terms

Recommendations

A Progressive Skip Reasoning Fusion Method for Multi-Modal Classification

Multispectral Object Detection via Cross-Modal Conflict-Aware Learning

Multi-modal bioelectrical signal fusion analysis based on different acquisition devices and scene settings: Overview, challenges, and novel orientation

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Share

Share this Publication link

Share on social media

Affiliations