DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video

Narek Tumanyan¹³,
Assaf Singer¹³,
Shai Bagon¹³ &
…
Tali Dekel¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15084))

Included in the following conference series:

European Conference on Computer Vision

129 Accesses

Abstract

We present DINO-Tracker – a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO’s features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO’s semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.

N. Tumanyan and A. Singer—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Self-supervised Any-Point Tracking by Contrastive Random Walks

Know Your Surroundings: Exploiting Scene Information for Object Tracking

Long-Term Visual Object Tracking Benchmark

References

Aflalo, A., Bagon, S., Kashti, T., Eldar, Y.C.: Deepcut: unsupervised segmentation using graph neural networks clustering. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 32–41 (2022)
Google Scholar
Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. In: ECCVW What is Motion For? (2022)
Google Scholar
Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multiscale contrastive random walks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6508–6519 (2022)
Google Scholar
Biggs, B., Roddick, T., Fitzgibbon, A., Cipolla, R.: Creatures great and SMAL: recovering the shape and motion of animals from video. In: ACCV (2018)
Google Scholar
Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: 1993 (4th) International Conference on Computer Vision, pp. 231–236 (1993)
Google Scholar
Bruhn, A., Weickert, J., Schnörr, C.: Lucas/kanade meets horn/schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61, 211–231 (2005)
Article Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
Google Scholar
Chang, J., Wei, D., III, J.W.F.: A video representation using temporal superpixels. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2051–2058 (2013)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020)
Dekel, T., Oron, S., Rubinstein, M., Avidan, S., Freeman, W.T.: Best-buddies similarity for robust template matching. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2021–2029 (2015)
Google Scholar
Doersch, C., et al.: Tap-vid: a benchmark for tracking any point in a video. In: NeurIPS Datasets Track (2022)
Google Scholar
Doersch, C., et al.: Tapir: tracking any point with per-frame initialization and temporal refinement. In: ICCV (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766 (2015)
Google Scholar
Gupta, K., et al.: ASIC: aligning sparse image collections. In: ICCV (2023)
Google Scholar
Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised semantic segmentation by distilling feature correspondences. In: International Conference on Learning Representations (2022)
Google Scholar
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: ECCV (2022)
Google Scholar
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1), 185–203 (1981)
Article Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
Google Scholar
Huang, Z., et al.: FlowFormer: a transformer architecture for optical flow. arXiv abs/2203.16194 (2022)
Google Scholar
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35, 492–518 (1964)
Article MathSciNet Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1647–1655 (2016)
Google Scholar
Jabri, A., Owens, A., Efros, A.A.: Space-time correspondence as a contrastive random walk. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: it is better to track together. arXiv:2307.07635 (2023)
Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., Yang, M.H.: Joint-task self-supervised learning for temporal correspondence. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Liu, C., Yuen, J., Torralba, A.: Sift flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33, 978–994 (2011)
Article Google Scholar
Lowe, G.: Sift-the scale invariant feature transform. Int. J. 2(91–110), 2 (2004)
Google Scholar
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981, Proceedings of the 7th International Joint Conference on Artificial Intelligence, vol. 2, pp. 674–679. Morgan Kaufmann Publishers Inc. (1981)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Google Scholar
Mariotti, O., Aodha, O.M., Bilen, H.: Improving semantic correspondence with viewpoint-guided spherical maps. arXiv:2312.13216 (2023)
Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Deep spectral methods: a surprisingly strong baseline for unsupervised semantic segmentation and localization. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8354–8365 (2022)
Google Scholar
Neoral, M., Šerých, J., Matas, J.: MFT: long-term tracking of every pixel. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6837–6847 (2024)
Google Scholar
Ofri-Amar, D., Geyer, M., Kasten, Y., Dekel, T.: Neural congealing: aligning images to a joint semantic atlas. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19403–19412 (2023)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv:2304.07193 (2023)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv:1704.00675 (2017)
Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Rubinstein, M., Liu, C.: Towards longer long-range motion trajectories. In: British Machine Vision Conference (2012)
Google Scholar
Salehi, M., Gavves, E., Snoek, C.G.M., Asano, Y.M.: Time does tell: self-supervised time-tuning of dense image representations. In: ICCV (2023)
Google Scholar
Sand, P., Teller, S.J.: Particle video: long-range motion estimation using point trajectories. Int. J. Comput. Vis. 80, 72–91 (2006)
Article Google Scholar
Shtedritski, A., Vedaldi, A., Rupprecht, C.: Learning universal semantic correspondences with no supervision and automatic data curation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 933–943 (2023)
Google Scholar
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2017)
Google Scholar
Sun, X., Harley, A.W., Guibas, L.J.: Refining pre-trained motion models. In: Proceedings of the IEEE International Conference on Robotics and Automation (2024)
Google Scholar
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (ECCV), pp. 402–419 (2020)
Google Scholar
Tumanyan, N., Bar-Tal, O., Amir, S., Bagon, S., Dekel, T.: Disentangling structure and appearance in ViT feature space. ACM Trans. Graph. 43, 1–6 (2023)
Article Google Scholar
Tumanyan, N., Bar-Tal, O., Bagon, S., Dekel, T.: Splicing ViT features for semantic appearance transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10748–10757 (2022)
Google Scholar
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Wang, Q., et al.: Tracking everything everywhere all at once. In: International Conference on Computer Vision (2023)
Google Scholar
Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 757–774. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_44
Chapter Google Scholar
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)
Google Scholar
Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D.: GMFlow: learning optical flow via global matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8111–8120 (2021)
Google Scholar
Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1289–1297 (2017)
Google Scholar
Xu, J., Wang, X.: Rethinking self-supervised correspondence learning: a video frame-level similarity perspective. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10075–10085 (2021)
Google Scholar
Zhai, M., Xiang, X., Lv, N., Kong, X.: Optical flow and scene flow estimation: a survey. Pattern Recogn. 114, 107861 (2021)
Article Google Scholar
Zhang, J., et al.: A tale of two features: stable diffusion complements DINO for zero-shot semantic correspondence. arXiv:2305.15347 (2023)
Zhang, J., et al.: Telling left from right: identifying geometry-aware semantic correspondence. arXiv:2311.17034 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv:2302.05543 (2023)
Zhao, W., Liu, S., Guo, H., Wang, W., Liu, Y.: ParticleSfM: exploiting dense point trajectories for localizing moving cameras in the wild. In: European Conference on Computer Vision (2022)
Google Scholar
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: a large-scale synthetic dataset for long-term point tracking. In: ICCV (2023)
Google Scholar
Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)
Google Scholar

Download references

Acknowledgements

We would like to thank Rafail Fridman for his insightful remarks and assistance. We would also like to thank the authors of Omnimotion for providing the trained weights for TAP-Vid-DAVIS and TAP-Vid-Kinetics videos. The project was supported by an ERC starting grant OmniVideo (10111768), by Shimon and Golde Picker, and by the Carolito Stiftung.

Dr. Bagon is a Robin Chemers Neustein AI Fellow. He received funding from the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center and MBZUAI-WIS Joint Program for AI Research.

Author information

Authors and Affiliations

Weizmann Institute of Science, Rehovot, Israel
Narek Tumanyan, Assaf Singer, Shai Bagon & Tali Dekel

Authors

Narek Tumanyan
View author publications
You can also search for this author in PubMed Google Scholar
Assaf Singer
View author publications
You can also search for this author in PubMed Google Scholar
Shai Bagon
View author publications
You can also search for this author in PubMed Google Scholar
Tali Dekel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Narek Tumanyan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 93760 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tumanyan, N., Singer, A., Bagon, S., Dekel, T. (2025). DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15084. Springer, Cham. https://doi.org/10.1007/978-3-031-73347-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-73347-5_21
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73346-8
Online ISBN: 978-3-031-73347-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Self-supervised Any-Point Tracking by Contrastive Random Walks

Know Your Surroundings: Exploiting Scene Information for Object Tracking

Long-Term Visual Object Tracking Benchmark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 93760 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Self-supervised Any-Point Tracking by Contrastive Random Walks

Know Your Surroundings: Exploiting Scene Information for Object Tracking

Long-Term Visual Object Tracking Benchmark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 93760 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation