[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

High performance RGB-Thermal Video Object Detection via hybrid fusion with progressive interaction and temporal-modal difference

Published: 01 February 2025 Publication History

Abstract

RGB-Thermal Video Object Detection (RGBT VOD) is to localize and classify the predefined objects in visible and thermal spectrum videos. The key issue in RGBT VOD lies in integrating multi-modal information effectively to improve detection performance. Current multi-modal fusion methods predominantly employ middle fusion strategies, but the inherent modal difference directly influences the effect of multi-modal fusion. Although the early fusion strategy reduces the modality gap in the middle stage of the network, achieving in-depth feature interaction between different modalities remains challenging. In this work, we propose a novel hybrid fusion network called PTMNet, which effectively combines the early fusion strategy with the progressive interaction and the middle fusion strategy with the temporal-modal difference, for high performance RGBT VOD. In particular, we take each modality as a master modality to achieve an early fusion with other modalities as auxiliary information by progressive interaction. Such a design not only alleviates the modality gap but facilitates middle fusion. The temporal-modal difference models temporal information through spatial offsets and utilizes feature erasure between modalities to motivate the network to focus on shared objects in both modalities. The hybrid fusion can achieve high detection accuracy only using three input frames, which makes our PTMNet achieve a high inference speed. Experimental results show that our approach achieves state-of-the-art performance on the VT-VOD50 dataset and also operates at over 70 FPS. The code will be freely released at https://github.com/tzz-ahu for academic purposes.

Highlights

A hybrid fusion strategy network for RGB-Thermal video object detection.
An early strategy for reducing modal disparities.
A novel differential method for modeling multimodal and temporal information.
The proposed PTMNet achieves SOTA performance on the VT-VOD50 dataset.

References

[1]
Tu Zhengzheng, Wang Qishun, Wang Hongshun, Wang Kunpeng, Li Chenglong, Erasure-based interaction network for RGBT video object detection and a unified benchmark, 2023, arXiv preprint arXiv:2308.01630.
[2]
Feng Di, Haase-Schütz Christian, Rosenbaum Lars, Hertlein Heinz, Glaeser Claudius, Timm Fabian, Wiesbeck Werner, Dietmayer Klaus, Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges, IEEE Trans. Intell. Transp. Syst. 22 (3) (2020) 1341–1360.
[3]
Brenner Martin, Reyes Napoleon H., Susnjak Teo, Barczak Andre L.C., Rgb-d and thermal sensor fusion: A systematic literature review, IEEE Access (2023).
[4]
Yabin Zhu, Chenglong Li, Bin Luo, Jin Tang, Xiao Wang, Dense feature aggregation and pruning for RGBT tracking, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 465–472.
[5]
Lu Andong, Qian Cun, Li Chenglong, Tang Jin, Wang Liang, Duality-gated mutual condition network for RGBT tracking, IEEE Trans. Neural Netw. Learn. Syst. (2022).
[6]
Zijian Zhao, Jie Zhang, Shiguang Shan, Noise Robust Hard Example Mining for Human Detection with Efficient Depth-Thermal Fusion, in: International Conference on Automatic Face and Gesture Recognition, 2020.
[7]
Ozcan Ahmet, Cetin Omer, A novel fusion method with thermal and RGB-D sensor data for human detection, IEEE Access 10 (2022) 66831–66843.
[8]
Shlens Jonathon, A tutorial on principal component analysis, 2014, arXiv preprint arXiv:1404.1100.
[9]
Pengyu Zhang, Jie Zhao, Dong Wang, Huchuan Lu, Xiang Ruan, Visible-thermal UAV tracking: A large-scale benchmark and new baseline, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8886–8895.
[10]
Haiping Wu, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang, Sequence level semantics aggregation for video object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9217–9225.
[11]
Tao Gong, Kai Chen, Xinjiang Wang, Qi Chu, Feng Zhu, Dahua Lin, Nenghai Yu, Huamin Feng, Temporal ROI align for video object recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 2, 2021, pp. 1442–1450.
[12]
Khurram Azeem Hashmi, Alain Pagani, Didier Stricker, Muhammad Zeshan Afzal, BoxMask: Revisiting Bounding Box Supervision for Video Object Detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2030–2040.
[13]
Ren Shaoqing, He Kaiming, Girshick Ross, Sun Jian, Faster r-cnn: Towards real-time object detection with region proposal networks, Adv. Neural Inf. Process. Syst. 28 (2015).
[14]
Liu Wei, Anguelov Dragomir, Erhan Dumitru, Szegedy Christian, Reed Scott, Fu Cheng-Yang, Berg Alexander C, Ssd: Single shot multibox detector, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, 2016, pp. 21–37.
[15]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
[16]
Yihong Chen, Yue Cao, Han Hu, Liwei Wang, Memory enhanced global-local aggregation for video object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10337–10346.
[17]
Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson, Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 3, 2021, pp. 2620–2627.
[18]
Han Liang, Yin Zhaozheng, Global memory and local continuity for video object detection, IEEE Trans. Multimed. (2022).
[19]
Yuheng Shi, Naiyan Wang, Xiaojie Guo, Yolov: Making still image object detectors great at video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, No. 2, 2023, pp. 2254–2262.
[20]
Ge Zheng, Liu Songtao, Wang Feng, Li Zeming, Sun Jian, Yolox: Exceeding yolo series in 2021, 2021, arXiv preprint arXiv:2107.08430.
[21]
Huang Shih-Chia, Chen Bo-Hao, Highly accurate moving object detection in variable bit rate video-based traffic monitoring systems, IEEE Trans. Neural Netw. Learn. Syst. 24 (12) (2013) 1920–1931.
[22]
Tu Zhengzheng, Li Zhun, Li Chenglong, Lang Yang, Tang Jin, Multi-interactive dual-decoder for RGB-thermal salient object detection, IEEE Trans. Image Process. 30 (2021) 5678–5691.
[23]
Tu Zhengzheng, Li Zhun, Li Chenglong, Tang Jin, Weakly alignment-free RGBT salient object detection with deep correlation network, IEEE Trans. Image Process. 31 (2022) 3752–3764.
[24]
Chen Yi-Ting, Shi Jinghao, Ye Zelin, Mertz Christoph, Ramanan Deva, Kong Shu, Multimodal object detection via probabilistic ensembling, in: European Conference on Computer Vision, Springer, 2022, pp. 139–158.
[25]
Huang Liming, Song Kechen, Wang Jie, Niu Menghui, Yan Yunhui, Multi-graph fusion and learning for RGBT image saliency detection, IEEE Trans. Circuits Syst. Video Technol. 32 (3) (2022) 1366–1377.
[26]
Bojarski Mariusz, Del Testa Davide, Dworakowski Daniel, Firner Bernhard, Flepp Beat, Goyal Prasoon, Jackel Lawrence D., Monfort Mathew, Muller Urs, Zhang Jiakai, et al., End to end learning for self-driving cars, 2016, arXiv preprint arXiv:1604.07316.
[27]
Li Wei, Pan C.W., Zhang Rong, Ren J.P., Ma Y.X., Fang Jin, Yan F.L., Geng Q.C., Huang X.Y., Gong H.J., et al., AADS: Augmented autonomous driving simulation using data-driven algorithms, Sci. Robotics 4 (28) (2019) eaaw0863.
[28]
Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, Yichen Wei, Flow-guided feature aggregation for video object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 408–417.
[29]
Shiyao Wang, Yucong Zhou, Junjie Yan, Zhidong Deng, Fully motion-aware network for video object detection, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 542–557.
[30]
Jiang Zhengkai, Liu Yu, Yang Ceyuan, Liu Jihao, Gao Peng, Zhang Qian, Xiang Shiming, Pan Chunhong, Learning where to focus for efficient video object detection, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, Springer, 2020, pp. 18–34.
[31]
Yao Chun-Han, Fang Chen, Shen Xiaohui, Wan Yangyue, Yang Ming-Hsuan, Video object detection via object-level temporal aggregation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, Springer, 2020, pp. 160–177.
[32]
Yiming Cui, Liqi Yan, Zhiwen Cao, Dongfang Liu, Tf-blender: Temporal feature blender for video object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8138–8147.
[33]
Fei He, Naiyu Gao, Jian Jia, Xin Zhao, Kaiqi Huang, Queryprop: Object query propagation for high-performance video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, No. 1, 2022, pp. 834–842.
[34]
Zhou Qianyu, Li Xiangtai, He Lu, Yang Yibo, Cheng Guangliang, Tong Yunhai, Ma Lizhuang, Tao Dacheng, TransVOD: End-to-end video object detection with spatial-temporal transformers, IEEE Trans. Pattern Anal. Mach. Intell. (2022) 1–16.
[35]
Sunkara Raja, Luo Tie, No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2022, pp. 443–459.
[36]
Qiang Chen, Yingming Wang, Tong Yang, Xiangyu Zhang, Jian Cheng, Jian Sun, You only look one-level feature, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13039–13048.
[37]
McCulloch Warren S., Pitts Walter, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys. 5 (1943) 115–133.
[38]
Limin Wang, Zhan Tong, Bin Ji, Gangshan Wu, Tdn: Temporal difference networks for efficient action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1895–1904.
[39]
Srivastava Nitish, Mansimov Elman, Salakhutdinov Ruslan, Unsupervised learning of video representations using LSTMs, in: International Conference on Machine Learning, 2015, arXiv:1502.04681.
[40]
Simonyan Karen, Zisserman Andrew, Two-stream convolutional networks for action recognition in videos, Adv. Neural Inf. Process. Syst. 27 (2014).
[41]
Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, Jian Sun, Ota: Optimal transport assignment for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 303–312.
[42]
Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, Thomas Huang, Unitbox: An advanced object detection network, in: Proceedings of the 24th ACM International Conference on Multimedia, 2016, pp. 516–520.
[43]
Redmon Joseph, Farhadi Ali, Yolov3: An incremental improvement, 2018, arXiv preprint arXiv:1804.02767.
[44]
Jocher Glenn, Ultralytics YOLOv5, 2020.
[45]
Qingyun Fang, Dapeng Han, Zhaokui Wang, Cross-modality fusion transformer for multispectral object detection, 2021, arXiv preprint arXiv:2111.00273.
[46]
Li Chuyi, Li Lulu, Jiang Hongliang, Weng Kaiheng, Geng Yifei, Li Liang, Ke Zaidan, Li Qingyuan, Cheng Meng, Nie Weiqiang, et al., YOLOv6: A single-stage object detection framework for industrial applications, 2022, arXiv preprint arXiv:2209.02976.
[47]
Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao, YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors, in: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2023, pp. 7464–7475.
[48]
Wang Chien-Yao, Yeh I-Hau, Liao Hong-Yuan Mark, YOLOv9: Learning what you want to learn using programmable gradient information, 2024, arXiv preprint arXiv:2402.13616.
[49]
Wang Ao, Chen Hui, Liu Lihao, Chen Kai, Lin Zijia, Han Jungong, Ding Guiguang, Yolov10: Real-time end-to-end object detection, 2024, arXiv preprint arXiv:2405.14458.
[50]
Feng Chengjian, Zhong Yujie, Gao Yu, Scott Matthew R., Huang Weilin, Tood: Task-aligned one-stage object detection, in: 2021 IEEE/CVF International Conference on Computer Vision, ICCV, IEEE Computer Society, 2021, pp. 3490–3499.
[51]
Zhu Xizhou, Su Weijie, Lu Lewei, Li Bin, Wang Xiaogang, Dai Jifeng, Deformable DETR: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations, 2021.
[52]
Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen, Detrs beat yolos on real-time object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16965–16974.
[53]
Zhang Hao, Li Feng, Liu Shilong, Zhang Lei, Su Hang, Zhu Jun, Ni Lionel M., Shum Heung-Yeung, Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022, arXiv preprint arXiv:2203.03605.
[54]
Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, Kai Chen, Dense distinct query for end-to-end object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7329–7338.
[55]
Shoufa Chen, Peize Sun, Yibing Song, Ping Luo, Diffusiondet: Diffusion model for object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19830–19843.
[56]
Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, Yichen Wei, Deep feature flow for video recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2349–2358.
[57]
Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, Tao Mei, Relation distillation networks for video object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7023–7032.
[58]
Lin Zhi, Lin Junhao, Zhu Lei, Fu Huazhu, Qin Jing, Wang Liansheng, A new dataset and a baseline model for breast lesion detection in ultrasound videos, in: Wang Linwei, Dou Qi, Fletcher P. Thomas, Speidel Stefanie, Li Shuo (Eds.), Medical Image Computing and Computer Assisted Intervention, MICCAI 2022, Springer Nature Switzerland, Cham, 2022, pp. 614–623.
[59]
Qin Chao, Cao Jiale, Fu Huazhu, Anwer Rao Muhammad, Khan Fahad Shahbaz, A spatial-temporal deformable attention based framework for breast lesion detection in videos, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2023, pp. 479–488.
[60]
Bochkovskiy Alexey, Wang Chien-Yao, Liao Hong-Yuan Mark, Yolov4: Optimal speed and accuracy of object detection, 2020, arXiv preprint arXiv:2004.10934.
[61]
Krizhevsky Alex, Sutskever Ilya, Hinton Geoffrey E., Imagenet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012).
[62]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[63]
Simonyan Karen, Zisserman Andrew, Very deep convolutional networks for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
[64]
Weng Kaiheng, Chu Xiangxiang, Xu Xiaoming, Huang Junshi, Wei Xiaoming, Efficientrep: An efficient repvgg-style convnets with hardware-aware neural network design, 2023, arXiv preprint arXiv:2302.00386.
[65]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
[66]
Wu Yu-Huan, Liu Yun, Xu Jun, Bian Jia-Wang, Gu Yu-Chao, Cheng Ming-Ming, MobileSal: Extremely efficient RGB-D salient object detection, IEEE Trans. Pattern Anal. Mach. Intell. 44 (12) (2022) 10261–10269.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Fusion
Information Fusion  Volume 114, Issue C
Feb 2025
1192 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 February 2025

Author Tags

  1. Video object detection
  2. Multi-modal fusion
  3. RGB-thermal
  4. Temporal difference
  5. Hybrid strategy

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media