[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3681564acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Simplifying Cross-modal Interaction via Modality-Shared Features for RGBT Tracking

Published: 28 October 2024 Publication History

Abstract

Thermal infrared(TIR) data exhibits higher tolerance to extreme environments, making it a valuable complement to RGB data in tracking tasks. RGBT tracking aims to leverage information from RGB and TIR images for stable and robust tracking. However, existing RGBT tracking methods face challenges due to significant modality differences and selective emphasis on interactive information, leading to inefficiencies in the cross-modal interaction. To address these issues, we propose a novel Integrating Interaction into Modality-shared Features with ViT(IIMF) framework, which is a simplified cross-modal interaction network including modality-shared, RGB modality-specific, and TIR modality-specific branches. The Modality-shared branch aggregates modality-shared information and implements inter-modal interaction. Specifically, our approach first extracts modality-shared features from RGB and TIR features with a cross-attention mechanism. Furthermore, we design a Cross-Attention-based Modality-shared Information Aggregation(CAMIA) module to further aggregate modality-shared information with modality-shared tokens. We evaluate our model on three widely-used benchmark datasets and extensive experiments demonstrate that our method achieves state-of-the-art performance. All the source code are released at https://github.com/Liqiu-Chen/IIMF.

References

[1]
Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2016. Fully-convolutional siamese networks for object tracking. In Computer Vision--ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8--10 and 15--16, 2016, Proceedings, Part II 14. Springer, 850--865.
[2]
Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2019. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision. 6182--6191.
[3]
Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2019. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision. 6182--6191.
[4]
Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. 2022. Backbone is all your need: A simplified architecture for visual object tracking. In European Conference on Computer Vision. Springer, 375--392.
[5]
Zhiyuan Cheng, Andong Lu, Zhang Zhang, Chenglong Li, and Liang Wang. 2022. Fusion Tree Network for RGBT Tracking. In 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--8.
[6]
Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. 2022. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13608--13618.
[7]
Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2019. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4660--4669.
[8]
Xingping Dong and Jianbing Shen. 2018. Triplet loss in siamese network for object tracking. In Proceedings of the European conference on computer vision (ECCV). 459--474.
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[10]
Liangsong Fan and Pyeoungkee Kim. 2023. Anchor free based Siamese network tracker with transformer for RGB-T tracking. Scientific Reports, Vol. 13, 1 (2023), 13294.
[11]
Yuan Gao, Chenglong Li, Yabin Zhu, Jin Tang, Tao He, and Futian Wang. 2019. Deep adaptive fusion network for high performance RGBT tracking. In Proceedings of the IEEE/CVF International conference on computer vision workshops. 0--0.
[12]
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.
[13]
Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and Song Wang. 2017. Learning dynamic siamese network for visual object tracking. In Proceedings of the IEEE international conference on computer vision. 1763--1771.
[14]
Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2019. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE transactions on pattern analysis and machine intelligence, Vol. 43, 5 (2019), 1562--1577.
[15]
Tianrui Hui, Zizheng Xun, Fengguang Peng, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. 2023. Bridging search region interaction with template for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13630--13639.
[16]
Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. 2018. Real-time mdnet. In Proceedings of the European conference on computer vision (ECCV). 83--98.
[17]
Janani Kugarajeevan, Thanikasalam Kokul, Amirthalingam Ramanan, and Subha Fernando. 2023. Transformers in single object tracking: An experimental survey. IEEE Access (2023).
[18]
Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, J Siamrpn Yan, et al. 2019. Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 15--20.
[19]
Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8971--8980.
[20]
Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. 2019. RGB-T object tracking: Benchmark and baseline. Pattern Recognition, Vol. 96 (2019), 106977.
[21]
Chenglong Li, Lei Liu, Andong Lu, Qing Ji, and Jin Tang. 2020. Challenge-aware RGBT tracking. In European conference on computer vision. Springer, 222--237.
[22]
Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. 2021. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing, Vol. 31 (2021), 392--404.
[23]
Chenglong Li, Nan Zhao, Yijuan Lu, Chengli Zhu, and Jin Tang. 2017. Weighted sparse representation regularized graph learning for RGB-T object tracking. In Proceedings of the 25th ACM international conference on Multimedia. 1856--1864.
[24]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[25]
Lei Liu, Chenglong Li, Yun Xiao, Rui Ruan, and Minghao Fan. 2024. RGBT Tracking via Challenge-based Appearance Disentanglement and Interaction. IEEE Transactions on Image Processing (2024).
[26]
Lei Liu, Chenglong Li, Yun Xiao, and Jin Tang. 2023. Quality-aware rgbt tracking via supervised reliability learning and weighted residual guidance. In Proceedings of the 31st ACM International Conference on Multimedia. 3129--3137.
[27]
Cheng Long Li, Andong Lu, Ai Hua Zheng, Zhengzheng Tu, and Jin Tang. 2019. Multi-adapter RGBT tracking. In Proceedings of the IEEE/CVF international conference on computer vision workshops. 0--0.
[28]
Cheng Long Li, Andong Lu, Ai Hua Zheng, Zhengzheng Tu, and Jin Tang. 2019. Multi-adapter RGBT tracking. In Proceedings of the IEEE/CVF international conference on computer vision workshops. 0--0.
[29]
Andong Lu, Chenglong Li, Yuqing Yan, Jin Tang, and Bin Luo. 2021. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, Vol. 30 (2021), 5613--5625.
[30]
Andong Lu, Cun Qian, Chenglong Li, Jin Tang, and Liang Wang. 2022. Duality-gated mutual condition network for RGBT tracking. IEEE Transactions on Neural Networks and Learning Systems (2022).
[31]
Hyeonseob Nam and Bohyung Han. 2016. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4293--4302.
[32]
Jingchao Peng, Haitao Zhao, Zhengwei Hu, Yi Zhuang, and Bofan Wang. 2023. Siamese infrared and visible light fusion network for RGB-T tracking. International Journal of Machine Learning and Cybernetics, Vol. 14, 9 (2023), 3281--3293.
[33]
Jingchao Peng, Haitao Zhao, Zhengwei Hu, Yi Zhuang, and Bofan Wang. 2023. Siamese infrared and visible light fusion network for RGB-T tracking. International Journal of Machine Learning and Cybernetics, Vol. 14, 9 (2023), 3281--3293.
[34]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 658--666.
[35]
Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, and Bin Luo. 2024. Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens. arXiv preprint arXiv:2401.01674 (2024).
[36]
Chaoqun Wang, Chunyan Xu, Zhen Cui, Ling Zhou, Tong Zhang, Xiaoya Zhang, and Jian Yang. 2020. Cross-modal pattern-propagation for RGB-T tracking. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 7064--7073.
[37]
Hongyu Wang, Xiaotao Liu, Yifan Li, Meng Sun, Dian Yuan, and Jing Liu. 2024. Temporal adaptive rgbt tracking with modality prompt. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5436--5444.
[38]
Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1571--1580.
[39]
Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 1328--1338.
[40]
Yong Wang, Xian Wei, Xuan Tang, Hao Shen, and Huanlong Zhang. 2021. Adaptive fusion CNN features for RGBT object tracking. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, 7 (2021), 7831--7840.
[41]
Yun Xiao, Mengmeng Yang, Chenglong Li, Lei Liu, and Jin Tang. 2022. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2831--2838.
[42]
Fei Xie, Chunyu Wang, Guangting Wang, Wankou Yang, and Wenjun Zeng. 2021. Learning tracking representations via dual-branch fully transformer networks. In Proceedings of the IEEE/CVF international conference on computer vision. 2688--2697.
[43]
Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2022. Joint feature learning and relation modeling for tracking: A one-stream framework. In European conference on computer vision. Springer, 341--357.
[44]
Bin Yu, Ming Tang, Linyu Zheng, Guibo Zhu, Jinqiao Wang, Hao Feng, Xuetao Feng, and Hanqing Lu. 2021. High-performance discriminative tracking with transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9856--9865.
[45]
Hui Zhang, Lei Zhang, Li Zhuo, and Jing Zhang. 2020. Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors, Vol. 20, 2 (2020), 393.
[46]
Lichao Zhang, Martin Danelljan, Abel Gonzalez-Garcia, Joost Van De Weijer, and Fahad Shahbaz Khan. 2019. Multi-modal fusion for end-to-end RGB-T tracking. In Proceedings of the IEEE/CVF International conference on computer vision workshops. 0--0.
[47]
Pengyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. 2021. Learning adaptive attribute-driven representation for real-time RGB-T tracking. International Journal of Computer Vision, Vol. 129 (2021), 2714--2729.
[48]
Tianlu Zhang, Hongyuan Guo, Qiang Jiao, Qiang Zhang, and Jungong Han. 2023. Efficient rgb-t tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5404--5413.
[49]
Tianlu Zhang, Xueru Liu, Qiang Zhang, and Jungong Han. 2021. SiamCDA: Complementarity-and distractor-aware RGB-T tracking based on Siamese network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 3 (2021), 1403--1417.
[50]
Yunhua Zhang, Lijun Wang, Jinqing Qi, Dong Wang, Mengyang Feng, and Huchuan Lu. 2018. Structured siamese network for real-time visual tracking. In Proceedings of the European conference on computer vision (ECCV). 351--366.
[51]
Zhipeng Zhang and Houwen Peng. 2019. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4591--4600.
[52]
ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, and Zhang Bo. 2023. Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective. ACM Transactions on Multimedia Computing, Communications and Applications (2023).
[53]
Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. 2023. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9516--9526.
[54]
Yabin Zhu, Chenglong Li, Bin Luo, Jin Tang, and Xiao Wang. 2019. Dense feature aggregation and pruning for RGBT tracking. In Proceedings of the 27th ACM International Conference on Multimedia. 465--472.
[55]
Yabin Zhu, Chenglong Li, Jin Tang, and Bin Luo. 2020. Quality-aware feature aggregation network for robust RGBT tracking. IEEE Transactions on Intelligent Vehicles, Vol. 6, 1 (2020), 121--130.
[56]
Yabin Zhu, Chenglong Li, Jin Tang, Bin Luo, and Liang Wang. 2021. RGBT tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 2 (2021), 579--592.
[57]
Yabin Zhu, Chenglong Li, Jin Tang, Bin Luo, and Liang Wang. 2021. RGBT tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 2 (2021), 579--592.
[58]
Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. 2018. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European conference on computer vision (ECCV). 101--117.

Index Terms

  1. Simplifying Cross-modal Interaction via Modality-Shared Features for RGBT Tracking

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Check for updates

    Author Tags

    1. inter- and inner-modal interaction
    2. rgbt tracking

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 154
      Total Downloads
    • Downloads (Last 12 months)154
    • Downloads (Last 6 weeks)68
    Reflects downloads up to 23 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media