More Web Proxy on the site http://driver.im/

research-article

Open access

Simplifying Cross-modal Interaction via Modality-Shared Features for RGBT Tracking

Authors:

Zhenyu HeAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 1573 - 1582

https://doi.org/10.1145/3664647.3681564

Published: 28 October 2024 Publication History

Abstract

Thermal infrared(TIR) data exhibits higher tolerance to extreme environments, making it a valuable complement to RGB data in tracking tasks. RGBT tracking aims to leverage information from RGB and TIR images for stable and robust tracking. However, existing RGBT tracking methods face challenges due to significant modality differences and selective emphasis on interactive information, leading to inefficiencies in the cross-modal interaction. To address these issues, we propose a novel Integrating Interaction into Modality-shared Features with ViT(IIMF) framework, which is a simplified cross-modal interaction network including modality-shared, RGB modality-specific, and TIR modality-specific branches. The Modality-shared branch aggregates modality-shared information and implements inter-modal interaction. Specifically, our approach first extracts modality-shared features from RGB and TIR features with a cross-attention mechanism. Furthermore, we design a Cross-Attention-based Modality-shared Information Aggregation(CAMIA) module to further aggregate modality-shared information with modality-shared tokens. We evaluate our model on three widely-used benchmark datasets and extensive experiments demonstrate that our method achieves state-of-the-art performance. All the source code are released at https://github.com/Liqiu-Chen/IIMF.

References

[1]

Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2016. Fully-convolutional siamese networks for object tracking. In Computer Vision--ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8--10 and 15--16, 2016, Proceedings, Part II 14. Springer, 850--865.

[2]

Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2019. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision. 6182--6191.

[3]

Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2019. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision. 6182--6191.

[4]

Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. 2022. Backbone is all your need: A simplified architecture for visual object tracking. In European Conference on Computer Vision. Springer, 375--392.

Digital Library

[5]

Zhiyuan Cheng, Andong Lu, Zhang Zhang, Chenglong Li, and Liang Wang. 2022. Fusion Tree Network for RGBT Tracking. In 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--8.

[6]

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. 2022. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13608--13618.

[7]

Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2019. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4660--4669.

[8]

Xingping Dong and Jianbing Shen. 2018. Triplet loss in siamese network for object tracking. In Proceedings of the European conference on computer vision (ECCV). 459--474.

Digital Library

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[10]

Liangsong Fan and Pyeoungkee Kim. 2023. Anchor free based Siamese network tracker with transformer for RGB-T tracking. Scientific Reports, Vol. 13, 1 (2023), 13294.

[11]

Yuan Gao, Chenglong Li, Yabin Zhu, Jin Tang, Tao He, and Futian Wang. 2019. Deep adaptive fusion network for high performance RGBT tracking. In Proceedings of the IEEE/CVF International conference on computer vision workshops. 0--0.

[12]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.

Digital Library

[13]

Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and Song Wang. 2017. Learning dynamic siamese network for visual object tracking. In Proceedings of the IEEE international conference on computer vision. 1763--1771.

[14]

Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2019. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE transactions on pattern analysis and machine intelligence, Vol. 43, 5 (2019), 1562--1577.

[15]

Tianrui Hui, Zizheng Xun, Fengguang Peng, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. 2023. Bridging search region interaction with template for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13630--13639.

[16]

Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. 2018. Real-time mdnet. In Proceedings of the European conference on computer vision (ECCV). 83--98.

Digital Library

[17]

Janani Kugarajeevan, Thanikasalam Kokul, Amirthalingam Ramanan, and Subha Fernando. 2023. Transformers in single object tracking: An experimental survey. IEEE Access (2023).

[18]

Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, J Siamrpn Yan, et al. 2019. Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 15--20.

[19]

Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8971--8980.

[20]

Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. 2019. RGB-T object tracking: Benchmark and baseline. Pattern Recognition, Vol. 96 (2019), 106977.

Digital Library

[21]

Chenglong Li, Lei Liu, Andong Lu, Qing Ji, and Jin Tang. 2020. Challenge-aware RGBT tracking. In European conference on computer vision. Springer, 222--237.

Digital Library

[22]

Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. 2021. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing, Vol. 31 (2021), 392--404.

[23]

Chenglong Li, Nan Zhao, Yijuan Lu, Chengli Zhu, and Jin Tang. 2017. Weighted sparse representation regularized graph learning for RGB-T object tracking. In Proceedings of the 25th ACM international conference on Multimedia. 1856--1864.

Digital Library

[24]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

[25]

Lei Liu, Chenglong Li, Yun Xiao, Rui Ruan, and Minghao Fan. 2024. RGBT Tracking via Challenge-based Appearance Disentanglement and Interaction. IEEE Transactions on Image Processing (2024).

Digital Library

[26]

Lei Liu, Chenglong Li, Yun Xiao, and Jin Tang. 2023. Quality-aware rgbt tracking via supervised reliability learning and weighted residual guidance. In Proceedings of the 31st ACM International Conference on Multimedia. 3129--3137.

Digital Library

[27]

Cheng Long Li, Andong Lu, Ai Hua Zheng, Zhengzheng Tu, and Jin Tang. 2019. Multi-adapter RGBT tracking. In Proceedings of the IEEE/CVF international conference on computer vision workshops. 0--0.

[28]

Cheng Long Li, Andong Lu, Ai Hua Zheng, Zhengzheng Tu, and Jin Tang. 2019. Multi-adapter RGBT tracking. In Proceedings of the IEEE/CVF international conference on computer vision workshops. 0--0.

[29]

Andong Lu, Chenglong Li, Yuqing Yan, Jin Tang, and Bin Luo. 2021. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, Vol. 30 (2021), 5613--5625.

[30]

Andong Lu, Cun Qian, Chenglong Li, Jin Tang, and Liang Wang. 2022. Duality-gated mutual condition network for RGBT tracking. IEEE Transactions on Neural Networks and Learning Systems (2022).

[31]

Hyeonseob Nam and Bohyung Han. 2016. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4293--4302.

[32]

Jingchao Peng, Haitao Zhao, Zhengwei Hu, Yi Zhuang, and Bofan Wang. 2023. Siamese infrared and visible light fusion network for RGB-T tracking. International Journal of Machine Learning and Cybernetics, Vol. 14, 9 (2023), 3281--3293.

[33]

Jingchao Peng, Haitao Zhao, Zhengwei Hu, Yi Zhuang, and Bofan Wang. 2023. Siamese infrared and visible light fusion network for RGB-T tracking. International Journal of Machine Learning and Cybernetics, Vol. 14, 9 (2023), 3281--3293.

[34]

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 658--666.

[35]

Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, and Bin Luo. 2024. Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens. arXiv preprint arXiv:2401.01674 (2024).

[36]

Chaoqun Wang, Chunyan Xu, Zhen Cui, Ling Zhou, Tong Zhang, Xiaoya Zhang, and Jian Yang. 2020. Cross-modal pattern-propagation for RGB-T tracking. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 7064--7073.

[37]

Hongyu Wang, Xiaotao Liu, Yifan Li, Meng Sun, Dian Yuan, and Jing Liu. 2024. Temporal adaptive rgbt tracking with modality prompt. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5436--5444.

Digital Library

[38]

Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1571--1580.

[39]

Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 1328--1338.

[40]

Yong Wang, Xian Wei, Xuan Tang, Hao Shen, and Huanlong Zhang. 2021. Adaptive fusion CNN features for RGBT object tracking. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, 7 (2021), 7831--7840.

Digital Library

[41]

Yun Xiao, Mengmeng Yang, Chenglong Li, Lei Liu, and Jin Tang. 2022. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2831--2838.

[42]

Fei Xie, Chunyu Wang, Guangting Wang, Wankou Yang, and Wenjun Zeng. 2021. Learning tracking representations via dual-branch fully transformer networks. In Proceedings of the IEEE/CVF international conference on computer vision. 2688--2697.

[43]

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2022. Joint feature learning and relation modeling for tracking: A one-stream framework. In European conference on computer vision. Springer, 341--357.

Digital Library

[44]

Bin Yu, Ming Tang, Linyu Zheng, Guibo Zhu, Jinqiao Wang, Hao Feng, Xuetao Feng, and Hanqing Lu. 2021. High-performance discriminative tracking with transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9856--9865.

[45]

Hui Zhang, Lei Zhang, Li Zhuo, and Jing Zhang. 2020. Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors, Vol. 20, 2 (2020), 393.

[46]

Lichao Zhang, Martin Danelljan, Abel Gonzalez-Garcia, Joost Van De Weijer, and Fahad Shahbaz Khan. 2019. Multi-modal fusion for end-to-end RGB-T tracking. In Proceedings of the IEEE/CVF International conference on computer vision workshops. 0--0.

[47]

Pengyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. 2021. Learning adaptive attribute-driven representation for real-time RGB-T tracking. International Journal of Computer Vision, Vol. 129 (2021), 2714--2729.

Digital Library

[48]

Tianlu Zhang, Hongyuan Guo, Qiang Jiao, Qiang Zhang, and Jungong Han. 2023. Efficient rgb-t tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5404--5413.

[49]

Tianlu Zhang, Xueru Liu, Qiang Zhang, and Jungong Han. 2021. SiamCDA: Complementarity-and distractor-aware RGB-T tracking based on Siamese network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 3 (2021), 1403--1417.

[50]

Yunhua Zhang, Lijun Wang, Jinqing Qi, Dong Wang, Mengyang Feng, and Huchuan Lu. 2018. Structured siamese network for real-time visual tracking. In Proceedings of the European conference on computer vision (ECCV). 351--366.

Digital Library

[51]

Zhipeng Zhang and Houwen Peng. 2019. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4591--4600.

[52]

ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, and Zhang Bo. 2023. Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective. ACM Transactions on Multimedia Computing, Communications and Applications (2023).

[53]

Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. 2023. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9516--9526.

[54]

Yabin Zhu, Chenglong Li, Bin Luo, Jin Tang, and Xiao Wang. 2019. Dense feature aggregation and pruning for RGBT tracking. In Proceedings of the 27th ACM International Conference on Multimedia. 465--472.

Digital Library

[55]

Yabin Zhu, Chenglong Li, Jin Tang, and Bin Luo. 2020. Quality-aware feature aggregation network for robust RGBT tracking. IEEE Transactions on Intelligent Vehicles, Vol. 6, 1 (2020), 121--130.

[56]

Yabin Zhu, Chenglong Li, Jin Tang, Bin Luo, and Liang Wang. 2021. RGBT tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 2 (2021), 579--592.

[57]

Yabin Zhu, Chenglong Li, Jin Tang, Bin Luo, and Liang Wang. 2021. RGBT tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 2 (2021), 579--592.

[58]

Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. 2018. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European conference on computer vision (ECCV). 101--117.

Digital Library

Index Terms

Simplifying Cross-modal Interaction via Modality-Shared Features for RGBT Tracking
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Tracking

Recommendations

Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Modality gap between RGB and thermal infrared (TIR) images is a crucial issue but often overlooked in existing RGBT tracking methods. It can be observed that modality gap mainly lies in the image style difference. In this work, we propose a novel Coupled ...
Robust RGB-T Tracking via Adaptive Modality Weight Correlation Filters and Cross-modality Learning
RGBT tracking is gaining popularity due to its ability to provide effective tracking results in a variety of weather conditions. However, feature specificity and complementarity have not been fully used in existing models that directly fuse the ...
Exploring modality-shared appearance features and modality-invariant relation features for cross-modality person Re-IDentification
Highlights
- Using the modality-shared and modality-invariant features for cross-modality Re-ID.
Graphical abstract

Display Omitted

Abstract
Most existing cross-modality person Re-IDentification works rely on discriminative modality-shared features for reducing cross-modality variations and intra-modality variations. Despite their preliminary success, such modality-shared ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Key Research Project of the Pengcheng Laboratory
National Natural Science Foundation of China
Shenzhen Research Council

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
154
Total Downloads

Downloads (Last 12 months)154
Downloads (Last 6 weeks)68

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents