Abstract
We present DINO-Tracker – a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO’s features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO’s semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.
N. Tumanyan and A. Singer—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aflalo, A., Bagon, S., Kashti, T., Eldar, Y.C.: Deepcut: unsupervised segmentation using graph neural networks clustering. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 32–41 (2022)
Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. In: ECCVW What is Motion For? (2022)
Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multiscale contrastive random walks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6508–6519 (2022)
Biggs, B., Roddick, T., Fitzgibbon, A., Cipolla, R.: Creatures great and SMAL: recovering the shape and motion of animals from video. In: ACCV (2018)
Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: 1993 (4th) International Conference on Computer Vision, pp. 231–236 (1993)
Bruhn, A., Weickert, J., Schnörr, C.: Lucas/kanade meets horn/schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61, 211–231 (2005)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
Chang, J., Wei, D., III, J.W.F.: A video representation using temporal superpixels. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2051–2058 (2013)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020)
Dekel, T., Oron, S., Rubinstein, M., Avidan, S., Freeman, W.T.: Best-buddies similarity for robust template matching. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2021–2029 (2015)
Doersch, C., et al.: Tap-vid: a benchmark for tracking any point in a video. In: NeurIPS Datasets Track (2022)
Doersch, C., et al.: Tapir: tracking any point with per-frame initialization and temporal refinement. In: ICCV (2023)
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766 (2015)
Gupta, K., et al.: ASIC: aligning sparse image collections. In: ICCV (2023)
Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised semantic segmentation by distilling feature correspondences. In: International Conference on Learning Representations (2022)
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: ECCV (2022)
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1), 185–203 (1981)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
Huang, Z., et al.: FlowFormer: a transformer architecture for optical flow. arXiv abs/2203.16194 (2022)
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35, 492–518 (1964)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1647–1655 (2016)
Jabri, A., Owens, A., Efros, A.A.: Space-time correspondence as a contrastive random walk. In: Advances in Neural Information Processing Systems (2020)
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: it is better to track together. arXiv:2307.07635 (2023)
Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., Yang, M.H.: Joint-task self-supervised learning for temporal correspondence. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Liu, C., Yuen, J., Torralba, A.: Sift flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33, 978–994 (2011)
Lowe, G.: Sift-the scale invariant feature transform. Int. J. 2(91–110), 2 (2004)
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981, Proceedings of the 7th International Joint Conference on Artificial Intelligence, vol. 2, pp. 674–679. Morgan Kaufmann Publishers Inc. (1981)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Mariotti, O., Aodha, O.M., Bilen, H.: Improving semantic correspondence with viewpoint-guided spherical maps. arXiv:2312.13216 (2023)
Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Deep spectral methods: a surprisingly strong baseline for unsupervised semantic segmentation and localization. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8354–8365 (2022)
Neoral, M., Šerých, J., Matas, J.: MFT: long-term tracking of every pixel. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6837–6847 (2024)
Ofri-Amar, D., Geyer, M., Kasten, Y., Dekel, T.: Neural congealing: aligning images to a joint semantic atlas. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19403–19412 (2023)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv:2304.07193 (2023)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv:1704.00675 (2017)
Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Rubinstein, M., Liu, C.: Towards longer long-range motion trajectories. In: British Machine Vision Conference (2012)
Salehi, M., Gavves, E., Snoek, C.G.M., Asano, Y.M.: Time does tell: self-supervised time-tuning of dense image representations. In: ICCV (2023)
Sand, P., Teller, S.J.: Particle video: long-range motion estimation using point trajectories. Int. J. Comput. Vis. 80, 72–91 (2006)
Shtedritski, A., Vedaldi, A., Rupprecht, C.: Learning universal semantic correspondences with no supervision and automatic data curation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 933–943 (2023)
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2017)
Sun, X., Harley, A.W., Guibas, L.J.: Refining pre-trained motion models. In: Proceedings of the IEEE International Conference on Robotics and Automation (2024)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (ECCV), pp. 402–419 (2020)
Tumanyan, N., Bar-Tal, O., Amir, S., Bagon, S., Dekel, T.: Disentangling structure and appearance in ViT feature space. ACM Trans. Graph. 43, 1–6 (2023)
Tumanyan, N., Bar-Tal, O., Bagon, S., Dekel, T.: Splicing ViT features for semantic appearance transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10748–10757 (2022)
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Wang, Q., et al.: Tracking everything everywhere all at once. In: International Conference on Computer Vision (2023)
Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 757–774. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_44
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)
Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D.: GMFlow: learning optical flow via global matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8111–8120 (2021)
Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1289–1297 (2017)
Xu, J., Wang, X.: Rethinking self-supervised correspondence learning: a video frame-level similarity perspective. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10075–10085 (2021)
Zhai, M., Xiang, X., Lv, N., Kong, X.: Optical flow and scene flow estimation: a survey. Pattern Recogn. 114, 107861 (2021)
Zhang, J., et al.: A tale of two features: stable diffusion complements DINO for zero-shot semantic correspondence. arXiv:2305.15347 (2023)
Zhang, J., et al.: Telling left from right: identifying geometry-aware semantic correspondence. arXiv:2311.17034 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv:2302.05543 (2023)
Zhao, W., Liu, S., Guo, H., Wang, W., Liu, Y.: ParticleSfM: exploiting dense point trajectories for localizing moving cameras in the wild. In: European Conference on Computer Vision (2022)
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: a large-scale synthetic dataset for long-term point tracking. In: ICCV (2023)
Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)
Acknowledgements
We would like to thank Rafail Fridman for his insightful remarks and assistance. We would also like to thank the authors of Omnimotion for providing the trained weights for TAP-Vid-DAVIS and TAP-Vid-Kinetics videos. The project was supported by an ERC starting grant OmniVideo (10111768), by Shimon and Golde Picker, and by the Carolito Stiftung.
Dr. Bagon is a Robin Chemers Neustein AI Fellow. He received funding from the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center and MBZUAI-WIS Joint Program for AI Research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tumanyan, N., Singer, A., Bagon, S., Dekel, T. (2025). DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15084. Springer, Cham. https://doi.org/10.1007/978-3-031-73347-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-73347-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73346-8
Online ISBN: 978-3-031-73347-5
eBook Packages: Computer ScienceComputer Science (R0)