[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content

Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking

Published: 01 January 2021 Publication History


In this study, we propose a novel RGB-T tracking framework by jointly modeling both appearance and motion cues. First, to obtain a robust appearance model, we develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities. The fusion weights are determined by using offline-trained global and local multimodal fusion networks, and then adopted to linearly combine the response maps of RGB and T modalities. Second, when the appearance cue is unreliable, we comprehensively take motion cues, i.e., target and camera motions, into account to make the tracker robust. We further propose a tracker switcher to switch the appearance and motion trackers flexibly. Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.


T. Baltrusaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Proc. Eur. Conf. Comput. Vis. Workshop, 2016, pp. 850–865.
S. Chaib, H. Liu, Y. Gu, and H. Yao, “Deep feature fusion for VHR remote sensing scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 8, pp. 4775–4787, May 2017.
D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564–577, May 2003.
C. O. Conaire, N. E. O’Connor, E. Cooke, and A. F. Smeaton, “Comparison of fusion methods for thermo-visual surveillance tracking,” in Proc. 9th Int. Conf. Inf. Fusion, Jul. 2006, pp. 1–7.
C. Ó. Conaire, N. E. O’Connor, and A. Smeaton, “Thermo-visual feature fusion for object tracking using multiple spatiogram trackers,” Mach. Vis. Appl., vol. 19, nos. 5–6, pp. 483–494, Oct. 2008.
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2005, pp. 886–893.
M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient convolution operators for tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6638–6646.
M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Discriminative scale space tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 8, pp. 1561–1575, Aug. 2017.
M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond correlation filters: Learning continuous convolution operators for visual tracking,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 472–488.
R. Gade and T. B. Moeslund, “Thermal cameras and applications: A survey,” Mach. Vis. Appl., vol. 25, no. 1, pp. 245–262, Jan. 2014.
Y. Gao, C. Li, Y. Zhu, J. Tang, T. He, and F. Wang, “Deep adaptive fusion network for high performance RGBT tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–9.
F. Gustafssonet al., “Particle filters for positioning, navigation, and tracking,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 425–437, Aug. 2002.
M. Haghighat and M. A. Razian, “Fast-FMI: Non-reference image fusion metric,” in Proc. IEEE 8th Int. Conf. Appl. Inf. Commun. Technol. (AICT), Oct. 2014, pp. 1–3.
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
A. Jain, K. Nandakumar, and A. Ross, “Score normalization in multimodal biometric systems,” Pattern Recognit., vol. 38, no. 12, pp. 2270–2285, Dec. 2005.
M. Kristanet al., “The seventh visual object tracking VOT2019 challenge results,” in Proc. IEEE Int. Conf. Comput. Vis. Workshop, Oct. 2019, pp. 1–36.
G. Y. Kulikov and M. V. Kulikova, “The accurate continuous-discrete extended Kalman filter for radar tracking,” IEEE Trans. Signal Process., vol. 64, no. 4, pp. 948–958, Feb. 2016.
B. K. S. Kumar, “Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform,” Signal, Image Video Process., vol. 7, no. 6, pp. 1125–1143, Nov. 2013.
B. K. S. Kumar, “Image fusion based on pixel significance using cross bilateral filter,” Signal, Image Video Process., vol. 9, no. 5, pp. 1193–1204, Jul. 2015.
J. Kwon, H. S. Lee, F. C. Park, and K. M. Lee, “A geometric particle filter for template-based visual tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp. 625–643, Apr. 2014.
C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Trans. Image Process., vol. 25, no. 12, pp. 5743–5756, Dec. 2016.
C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “RGB-T object tracking: Benchmark and baseline,” Pattern Recognit., vol. 96, no. 12, Dec. 2019, Art. no.
C. L. Li, A. Lu, A. H. Zheng, Z. Tu, and J. Tang, “Multi-adapter RGBT tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–9.
C. Li, N. Zhao, Y. Lu, C. Zhu, and J. Tang, “Weighted sparse representation regularized graph learning for RGB-T object tracking,” in Proc. 25th ACM Int. Conf. Multimedia, Oct. 2017, pp. 1856–1864.
C. Li, C. Zhu, Y. Huang, J. Tang, and L. Wang, “Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 808–823.
C. Li, C. Zhu, J. Zhang, B. Luo, X. Wu, and J. Tang, “Learning local-global multi-graph descriptors for RGB-T object tracking,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 10, pp. 2913–2926, Oct. 2019.
H. Li and X.-J. Wu, “DenseFuse: A fusion approach to infrared and visible images,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2614–2623, May 2019.
S. Li and D.-Y. Yeung, “Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 4140–4146.
Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade, “Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 10, pp. 1728–1740, Oct. 2008.
T.-Y. Linet al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
C. H. Liu, Y. Qi, and W. R. Ding, “Infrared and visible image fusion method based on saliency detection in sparse domain,” Infr. Phys. Technol., vol. 83, pp. 94–102, Jun. 2017.
H. Liu and F. Sun, “Fusion tracking in color and infrared images using joint sparse representation,” Sci. China Inf. Sci., vol. 55, no. 3, pp. 590–599, Mar. 2012.
Y. Liu, X. Chen, R. K. Ward, and Z. Jane Wang, “Image fusion with convolutional sparse representation,” IEEE Signal Process. Lett., vol. 23, no. 12, pp. 1882–1886, Dec. 2016.
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
A. Lukežič, L. Čehovin Zajc, T. Vojíř, J. Matas, and M. Kristan, “FuCoLoT—A fully-correlational long-term tracker,” in Proc. Asian Conf. Comput. Vis., 2018, pp. 595–611.
J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “FusionGAN: A generative adversarial network for infrared and visible image fusion,” Inf. Fusion, vol. 48, pp. 11–26, Aug. 2019.
J. Ma, Z. Zhou, B. Wang, and H. Zong, “Infrared and visible image fusion based on visual saliency map and weighted least square optimization,” Infr. Phys. Technol., vol. 82, pp. 8–17, May 2017.
K. Ma, K. Zeng, and Z. Wang, “Perceptual quality assessment for multi-exposure image fusion,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3345–3356, Nov. 2015.
Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning for audio-visual speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Apr. 2015, pp. 2130–2134.
H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4293–4302.
W. Zheng and S. M. Bhandarkar, “A boosted adaptive particle filter for face detection and tracking,” in Proc. Int. Conf. Image Process., Oct. 2006, pp. 28–39.
S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 6548–6552.
S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain, “Fusing audio, visual and textual clues for sentiment analysis from multimodal content,” Neurocomputing, vol. 174, no. 22, pp. 50–59, Jan. 2016.
G. Qu, D. Zhang, and P. Yan, “Information measure for performance of image fusion,” Electron. Lett., vol. 38, no. 7, pp. 313–315, Mar. 2002.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
J. Van Aardt, “Assessment of image fusion procedures using entropy, image quality, and multispectral classification,” J. Appl. Remote Sens., vol. 2, no. 1, May 2008, Art. no.
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent., 2015, pp. 234–241.
H. Shao, H. Jiang, F. Wang, and H. Zhao, “An enhancement deep feature fusion method for rotating machinery fault diagnosis,” Knowl.-Based Syst., vol. 119, pp. 200–220, Mar. 2017.
I. Talmi, R. Mechrez, and L. Zelnik-Manor, “Template matching with deformable diversity similarity,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 175–183.
O. R. Terrades, E. Valveny, and S. Tabbone, “Optimal classifier fusion in a non-Bayesian probabilistic framework,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1630–1644, Sep. 2009.
J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learning color names for real-world applications,” IEEE Trans. Image Process., vol. 18, no. 7, pp. 1512–1523, Jul. 2009.
Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. S. Torr, “Fast online object tracking and segmentation: A unifying approach,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1328–1338.
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
G. Welch and G. Bishop, “An introduction to the Kalman filter,” Univ. North Carolina Chapel Hill, Chapel Hill, NC, USA, Tech. Rep. TR 95-041, 1995.
S.-K. Weng, C.-M. Kuo, and S.-K. Tu, “Video object tracking using adaptive Kalman filter,” J. Vis. Commun. Image Represent., vol. 17, no. 6, pp. 1190–1208, Dec. 2006.
Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2411–2418.
Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Sep. 2015.
C. S. Xydeas and V. Petrović, “Objective image fusion performance measure,” Electron. Lett., vol. 36, no. 4, pp. 308–309, 2000.
L. Yang, B. L. Guo, and W. Ni, “Multimodality medical image fusion based on multiscale geometric analysis of contourlet transform,” Neurocomputing, vol. 72, nos. 1–3, pp. 203–211, Dec. 2008.
R. Yang, Y. Zhu, X. Wang, C. Li, and J. Tang, “Learning target-oriented dual attention for robust RGB-T tracking,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2019, pp. 1–8.
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2017, pp. 1103–1114.
L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. van de Weijer, and F. S. Khan, “Multi-modal fusion for end-to-end RGB-T tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–10.
Q. Zhang, Y. Fu, H. Li, and J. Zou, “Dictionary learning method for joint sparse representation-based image fusion,” Opt. Eng., vol. 52, no. 5, May 2013, Art. no.
Z. Zhang, L. Yang, and Y. Zheng, “Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018.
L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian, “Query-adaptive late fusion for image search and person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1741–1750.
Y. Zhu, C. Li, Y. Lu, L. Lin, B. Luo, and J. Tang, “FANet: Quality-aware feature aggregation network for RGB-T tracking,” CoRR, vol. abs/1811.09855, pp. 1–11, Nov. 2018.
Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang, “Dense feature aggregation and pruning for RGBT tracking,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 465–472.

Cited By

View all
  • (2025)RGB-T Tracking With Template-Bridged Search Interaction and Target-Preserved Template UpdatingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347547247:1(634-649)Online publication date: 1-Jan-2025
  • (2025)SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT TrackingIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.351255126:2(1900-1913)Online publication date: 1-Feb-2025
  • (2025)Motion-guided small MAV detection in complex and non-planar scenesPattern Recognition Letters10.1016/j.patrec.2024.09.013186:C(98-105)Online publication date: 30-Jan-2025
  • Show More Cited By



Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors


Published In

cover image IEEE Transactions on Image Processing
IEEE Transactions on Image Processing  Volume 30, Issue
5053 pages


IEEE Press

Publication History

Published: 01 January 2021


  • Research-article


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics


Cited By

View all
  • (2025)RGB-T Tracking With Template-Bridged Search Interaction and Target-Preserved Template UpdatingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347547247:1(634-649)Online publication date: 1-Jan-2025
  • (2025)SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT TrackingIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.351255126:2(1900-1913)Online publication date: 1-Feb-2025
  • (2025)Motion-guided small MAV detection in complex and non-planar scenesPattern Recognition Letters10.1016/j.patrec.2024.09.013186:C(98-105)Online publication date: 30-Jan-2025
  • (2025)MKFTrackerKnowledge-Based Systems10.1016/j.knosys.2024.112860310:COnline publication date: 15-Feb-2025
  • (2025)A lightweight robust RGB-T object tracker based on Jitter Factor and associated Kalman filterInformation Fusion10.1016/j.inffus.2024.102842117:COnline publication date: 1-May-2025
  • (2024)Temporal adaptive RGBT tracking with modality promptProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i6.28352(5436-5444)Online publication date: 20-Feb-2024
  • (2024)Generative-based fusion mechanism for multi-modal trackingProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i6.28325(5189-5197)Online publication date: 20-Feb-2024
  • (2024)Bi-directional adapter for multimodal trackingProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i2.27852(927-935)Online publication date: 20-Feb-2024
  • (2024)ASFN: An RGB-T Adaptive Selection Fusion Network for Nighttime TrackingProceedings of the 2024 International Conference on Intelligent Perception and Pattern Recognition10.1145/3700035.3700054(111-120)Online publication date: 19-Jul-2024
  • (2024)Motion-Aware Self-Supervised RGBT Tracking with Multi-Modality Hierarchical TransformersACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369839920:12(1-23)Online publication date: 3-Oct-2024
  • Show More Cited By

View Options

View options






Share this Publication link

Share on social media