[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking

Published: 01 January 2021 Publication History

Abstract

In this study, we propose a novel RGB-T tracking framework by jointly modeling both appearance and motion cues. First, to obtain a robust appearance model, we develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities. The fusion weights are determined by using offline-trained global and local multimodal fusion networks, and then adopted to linearly combine the response maps of RGB and T modalities. Second, when the appearance cue is unreliable, we comprehensively take motion cues, i.e., target and camera motions, into account to make the tracker robust. We further propose a tracker switcher to switch the appearance and motion trackers flexibly. Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.

References

[1]
T. Baltrusaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.
[2]
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Proc. Eur. Conf. Comput. Vis. Workshop, 2016, pp. 850–865.
[3]
S. Chaib, H. Liu, Y. Gu, and H. Yao, “Deep feature fusion for VHR remote sensing scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 8, pp. 4775–4787, May 2017.
[4]
D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564–577, May 2003.
[5]
C. O. Conaire, N. E. O’Connor, E. Cooke, and A. F. Smeaton, “Comparison of fusion methods for thermo-visual surveillance tracking,” in Proc. 9th Int. Conf. Inf. Fusion, Jul. 2006, pp. 1–7.
[6]
C. Ó. Conaire, N. E. O’Connor, and A. Smeaton, “Thermo-visual feature fusion for object tracking using multiple spatiogram trackers,” Mach. Vis. Appl., vol. 19, nos. 5–6, pp. 483–494, Oct. 2008.
[7]
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2005, pp. 886–893.
[8]
M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient convolution operators for tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6638–6646.
[9]
M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Discriminative scale space tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 8, pp. 1561–1575, Aug. 2017.
[10]
M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond correlation filters: Learning continuous convolution operators for visual tracking,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 472–488.
[11]
R. Gade and T. B. Moeslund, “Thermal cameras and applications: A survey,” Mach. Vis. Appl., vol. 25, no. 1, pp. 245–262, Jan. 2014.
[12]
Y. Gao, C. Li, Y. Zhu, J. Tang, T. He, and F. Wang, “Deep adaptive fusion network for high performance RGBT tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–9.
[13]
F. Gustafssonet al., “Particle filters for positioning, navigation, and tracking,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 425–437, Aug. 2002.
[14]
M. Haghighat and M. A. Razian, “Fast-FMI: Non-reference image fusion metric,” in Proc. IEEE 8th Int. Conf. Appl. Inf. Commun. Technol. (AICT), Oct. 2014, pp. 1–3.
[15]
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
[16]
A. Jain, K. Nandakumar, and A. Ross, “Score normalization in multimodal biometric systems,” Pattern Recognit., vol. 38, no. 12, pp. 2270–2285, Dec. 2005.
[17]
M. Kristanet al., “The seventh visual object tracking VOT2019 challenge results,” in Proc. IEEE Int. Conf. Comput. Vis. Workshop, Oct. 2019, pp. 1–36.
[18]
G. Y. Kulikov and M. V. Kulikova, “The accurate continuous-discrete extended Kalman filter for radar tracking,” IEEE Trans. Signal Process., vol. 64, no. 4, pp. 948–958, Feb. 2016.
[19]
B. K. S. Kumar, “Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform,” Signal, Image Video Process., vol. 7, no. 6, pp. 1125–1143, Nov. 2013.
[20]
B. K. S. Kumar, “Image fusion based on pixel significance using cross bilateral filter,” Signal, Image Video Process., vol. 9, no. 5, pp. 1193–1204, Jul. 2015.
[21]
J. Kwon, H. S. Lee, F. C. Park, and K. M. Lee, “A geometric particle filter for template-based visual tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp. 625–643, Apr. 2014.
[22]
C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Trans. Image Process., vol. 25, no. 12, pp. 5743–5756, Dec. 2016.
[23]
C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “RGB-T object tracking: Benchmark and baseline,” Pattern Recognit., vol. 96, no. 12, Dec. 2019, Art. no.
[24]
C. L. Li, A. Lu, A. H. Zheng, Z. Tu, and J. Tang, “Multi-adapter RGBT tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–9.
[25]
C. Li, N. Zhao, Y. Lu, C. Zhu, and J. Tang, “Weighted sparse representation regularized graph learning for RGB-T object tracking,” in Proc. 25th ACM Int. Conf. Multimedia, Oct. 2017, pp. 1856–1864.
[26]
C. Li, C. Zhu, Y. Huang, J. Tang, and L. Wang, “Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 808–823.
[27]
C. Li, C. Zhu, J. Zhang, B. Luo, X. Wu, and J. Tang, “Learning local-global multi-graph descriptors for RGB-T object tracking,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 10, pp. 2913–2926, Oct. 2019.
[28]
H. Li and X.-J. Wu, “DenseFuse: A fusion approach to infrared and visible images,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2614–2623, May 2019.
[29]
S. Li and D.-Y. Yeung, “Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 4140–4146.
[30]
Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade, “Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 10, pp. 1728–1740, Oct. 2008.
[31]
T.-Y. Linet al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[32]
C. H. Liu, Y. Qi, and W. R. Ding, “Infrared and visible image fusion method based on saliency detection in sparse domain,” Infr. Phys. Technol., vol. 83, pp. 94–102, Jun. 2017.
[33]
H. Liu and F. Sun, “Fusion tracking in color and infrared images using joint sparse representation,” Sci. China Inf. Sci., vol. 55, no. 3, pp. 590–599, Mar. 2012.
[34]
Y. Liu, X. Chen, R. K. Ward, and Z. Jane Wang, “Image fusion with convolutional sparse representation,” IEEE Signal Process. Lett., vol. 23, no. 12, pp. 1882–1886, Dec. 2016.
[35]
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
[36]
A. Lukežič, L. Čehovin Zajc, T. Vojíř, J. Matas, and M. Kristan, “FuCoLoT—A fully-correlational long-term tracker,” in Proc. Asian Conf. Comput. Vis., 2018, pp. 595–611.
[37]
J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “FusionGAN: A generative adversarial network for infrared and visible image fusion,” Inf. Fusion, vol. 48, pp. 11–26, Aug. 2019.
[38]
J. Ma, Z. Zhou, B. Wang, and H. Zong, “Infrared and visible image fusion based on visual saliency map and weighted least square optimization,” Infr. Phys. Technol., vol. 82, pp. 8–17, May 2017.
[39]
K. Ma, K. Zeng, and Z. Wang, “Perceptual quality assessment for multi-exposure image fusion,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3345–3356, Nov. 2015.
[40]
Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning for audio-visual speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Apr. 2015, pp. 2130–2134.
[41]
H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4293–4302.
[42]
W. Zheng and S. M. Bhandarkar, “A boosted adaptive particle filter for face detection and tracking,” in Proc. Int. Conf. Image Process., Oct. 2006, pp. 28–39.
[43]
S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 6548–6552.
[44]
S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain, “Fusing audio, visual and textual clues for sentiment analysis from multimodal content,” Neurocomputing, vol. 174, no. 22, pp. 50–59, Jan. 2016.
[45]
G. Qu, D. Zhang, and P. Yan, “Information measure for performance of image fusion,” Electron. Lett., vol. 38, no. 7, pp. 313–315, Mar. 2002.
[46]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[47]
J. Van Aardt, “Assessment of image fusion procedures using entropy, image quality, and multispectral classification,” J. Appl. Remote Sens., vol. 2, no. 1, May 2008, Art. no.
[48]
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent., 2015, pp. 234–241.
[49]
H. Shao, H. Jiang, F. Wang, and H. Zhao, “An enhancement deep feature fusion method for rotating machinery fault diagnosis,” Knowl.-Based Syst., vol. 119, pp. 200–220, Mar. 2017.
[50]
I. Talmi, R. Mechrez, and L. Zelnik-Manor, “Template matching with deformable diversity similarity,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 175–183.
[51]
O. R. Terrades, E. Valveny, and S. Tabbone, “Optimal classifier fusion in a non-Bayesian probabilistic framework,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1630–1644, Sep. 2009.
[52]
J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learning color names for real-world applications,” IEEE Trans. Image Process., vol. 18, no. 7, pp. 1512–1523, Jul. 2009.
[53]
Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. S. Torr, “Fast online object tracking and segmentation: A unifying approach,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1328–1338.
[54]
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[55]
G. Welch and G. Bishop, “An introduction to the Kalman filter,” Univ. North Carolina Chapel Hill, Chapel Hill, NC, USA, Tech. Rep. TR 95-041, 1995.
[56]
S.-K. Weng, C.-M. Kuo, and S.-K. Tu, “Video object tracking using adaptive Kalman filter,” J. Vis. Commun. Image Represent., vol. 17, no. 6, pp. 1190–1208, Dec. 2006.
[57]
Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2411–2418.
[58]
Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Sep. 2015.
[59]
C. S. Xydeas and V. Petrović, “Objective image fusion performance measure,” Electron. Lett., vol. 36, no. 4, pp. 308–309, 2000.
[60]
L. Yang, B. L. Guo, and W. Ni, “Multimodality medical image fusion based on multiscale geometric analysis of contourlet transform,” Neurocomputing, vol. 72, nos. 1–3, pp. 203–211, Dec. 2008.
[61]
R. Yang, Y. Zhu, X. Wang, C. Li, and J. Tang, “Learning target-oriented dual attention for robust RGB-T tracking,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2019, pp. 1–8.
[62]
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2017, pp. 1103–1114.
[63]
L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. van de Weijer, and F. S. Khan, “Multi-modal fusion for end-to-end RGB-T tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–10.
[64]
Q. Zhang, Y. Fu, H. Li, and J. Zou, “Dictionary learning method for joint sparse representation-based image fusion,” Opt. Eng., vol. 52, no. 5, May 2013, Art. no.
[65]
Z. Zhang, L. Yang, and Y. Zheng, “Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018.
[66]
L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian, “Query-adaptive late fusion for image search and person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1741–1750.
[67]
Y. Zhu, C. Li, Y. Lu, L. Lin, B. Luo, and J. Tang, “FANet: Quality-aware feature aggregation network for RGB-T tracking,” CoRR, vol. abs/1811.09855, pp. 1–11, Nov. 2018.
[68]
Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang, “Dense feature aggregation and pruning for RGBT tracking,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 465–472.

Cited By

View all
  • (2024)ASFN: An RGB-T Adaptive Selection Fusion Network for Nighttime TrackingProceedings of the 2024 International Conference on Intelligent Perception and Pattern Recognition10.1145/3700035.3700054(111-120)Online publication date: 19-Jul-2024
  • (2024)Motion-Aware Self-Supervised RGBT Tracking with Multi-Modality Hierarchical TransformersACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369839920:12(1-23)Online publication date: 3-Oct-2024
  • (2024)Multi-Level Fusion for Robust RGBT Tracking via Enhanced Thermal RepresentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367817620:10(1-24)Online publication date: 15-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Image Processing
IEEE Transactions on Image Processing  Volume 30, Issue
2021
5053 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2021

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ASFN: An RGB-T Adaptive Selection Fusion Network for Nighttime TrackingProceedings of the 2024 International Conference on Intelligent Perception and Pattern Recognition10.1145/3700035.3700054(111-120)Online publication date: 19-Jul-2024
  • (2024)Motion-Aware Self-Supervised RGBT Tracking with Multi-Modality Hierarchical TransformersACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369839920:12(1-23)Online publication date: 3-Oct-2024
  • (2024)Multi-Level Fusion for Robust RGBT Tracking via Enhanced Thermal RepresentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367817620:10(1-24)Online publication date: 15-Jul-2024
  • (2024)X-Prompt: Multi-modal Visual Prompt for Video Object SegmentationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680581(5151-5160)Online publication date: 28-Oct-2024
  • (2024)TandemFuse: An Intra- and Inter-Modal Fusion Strategy for RGB-T TrackingProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3663996(1-7)Online publication date: 26-Apr-2024
  • (2024)Review and Analysis of RGBT Single Object Tracking Methods: A Fusion PerspectiveACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365130820:8(1-27)Online publication date: 7-Mar-2024
  • (2024)Learning Multi-Layer Attention Aggregation Siamese Network for Robust RGBT TrackingIEEE Transactions on Multimedia10.1109/TMM.2023.331029526(3378-3391)Online publication date: 1-Jan-2024
  • (2024)Global-Local MAV Detection Under Challenging Conditions Based on Appearance and MotionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.338117425:9(12005-12017)Online publication date: 1-Sep-2024
  • (2024)Exploring Multi-Modal Spatial–Temporal Contexts for High-Performance RGB-T TrackingIEEE Transactions on Image Processing10.1109/TIP.2024.342831633(4303-4318)Online publication date: 1-Jan-2024
  • (2024)QueryTrack: Joint-Modality Query Fusion Network for RGBT TrackingIEEE Transactions on Image Processing10.1109/TIP.2024.339329833(3187-3199)Online publication date: 30-Apr-2024
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media