[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15084))

Included in the following conference series:

  • 129 Accesses

Abstract

We present DINO-Tracker  – a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO’s features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO’s semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.

N. Tumanyan and A. Singer—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 49.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aflalo, A., Bagon, S., Kashti, T., Eldar, Y.C.: Deepcut: unsupervised segmentation using graph neural networks clustering. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 32–41 (2022)

    Google Scholar 

  2. Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. In: ECCVW What is Motion For? (2022)

    Google Scholar 

  3. Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multiscale contrastive random walks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6508–6519 (2022)

    Google Scholar 

  4. Biggs, B., Roddick, T., Fitzgibbon, A., Cipolla, R.: Creatures great and SMAL: recovering the shape and motion of animals from video. In: ACCV (2018)

    Google Scholar 

  5. Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: 1993 (4th) International Conference on Computer Vision, pp. 231–236 (1993)

    Google Scholar 

  6. Bruhn, A., Weickert, J., Schnörr, C.: Lucas/kanade meets horn/schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61, 211–231 (2005)

    Article  Google Scholar 

  7. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  8. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)

    Google Scholar 

  9. Chang, J., Wei, D., III, J.W.F.: A video representation using temporal superpixels. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2051–2058 (2013)

    Google Scholar 

  10. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020)

  11. Dekel, T., Oron, S., Rubinstein, M., Avidan, S., Freeman, W.T.: Best-buddies similarity for robust template matching. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2021–2029 (2015)

    Google Scholar 

  12. Doersch, C., et al.: Tap-vid: a benchmark for tracking any point in a video. In: NeurIPS Datasets Track (2022)

    Google Scholar 

  13. Doersch, C., et al.: Tapir: tracking any point with per-frame initialization and temporal refinement. In: ICCV (2023)

    Google Scholar 

  14. Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  15. Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766 (2015)

    Google Scholar 

  16. Gupta, K., et al.: ASIC: aligning sparse image collections. In: ICCV (2023)

    Google Scholar 

  17. Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised semantic segmentation by distilling feature correspondences. In: International Conference on Learning Representations (2022)

    Google Scholar 

  18. Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: ECCV (2022)

    Google Scholar 

  19. Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1), 185–203 (1981)

    Article  Google Scholar 

  20. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

    Google Scholar 

  21. Huang, Z., et al.: FlowFormer: a transformer architecture for optical flow. arXiv abs/2203.16194 (2022)

    Google Scholar 

  22. Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35, 492–518 (1964)

    Article  MathSciNet  Google Scholar 

  23. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1647–1655 (2016)

    Google Scholar 

  24. Jabri, A., Owens, A., Efros, A.A.: Space-time correspondence as a contrastive random walk. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  25. Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: it is better to track together. arXiv:2307.07635 (2023)

  26. Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., Yang, M.H.: Joint-task self-supervised learning for temporal correspondence. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  27. Liu, C., Yuen, J., Torralba, A.: Sift flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33, 978–994 (2011)

    Article  Google Scholar 

  28. Lowe, G.: Sift-the scale invariant feature transform. Int. J. 2(91–110), 2 (2004)

    Google Scholar 

  29. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981, Proceedings of the 7th International Joint Conference on Artificial Intelligence, vol. 2, pp. 674–679. Morgan Kaufmann Publishers Inc. (1981)

    Google Scholar 

  30. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)

    Google Scholar 

  31. Mariotti, O., Aodha, O.M., Bilen, H.: Improving semantic correspondence with viewpoint-guided spherical maps. arXiv:2312.13216 (2023)

  32. Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Deep spectral methods: a surprisingly strong baseline for unsupervised semantic segmentation and localization. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8354–8365 (2022)

    Google Scholar 

  33. Neoral, M., Šerých, J., Matas, J.: MFT: long-term tracking of every pixel. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6837–6847 (2024)

    Google Scholar 

  34. Ofri-Amar, D., Geyer, M., Kasten, Y., Dekel, T.: Neural congealing: aligning images to a joint semantic atlas. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19403–19412 (2023)

    Google Scholar 

  35. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv:2304.07193 (2023)

  36. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv:1704.00675 (2017)

  37. Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  38. Rubinstein, M., Liu, C.: Towards longer long-range motion trajectories. In: British Machine Vision Conference (2012)

    Google Scholar 

  39. Salehi, M., Gavves, E., Snoek, C.G.M., Asano, Y.M.: Time does tell: self-supervised time-tuning of dense image representations. In: ICCV (2023)

    Google Scholar 

  40. Sand, P., Teller, S.J.: Particle video: long-range motion estimation using point trajectories. Int. J. Comput. Vis. 80, 72–91 (2006)

    Article  Google Scholar 

  41. Shtedritski, A., Vedaldi, A., Rupprecht, C.: Learning universal semantic correspondences with no supervision and automatic data curation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 933–943 (2023)

    Google Scholar 

  42. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2017)

    Google Scholar 

  43. Sun, X., Harley, A.W., Guibas, L.J.: Refining pre-trained motion models. In: Proceedings of the IEEE International Conference on Robotics and Automation (2024)

    Google Scholar 

  44. Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (ECCV), pp. 402–419 (2020)

    Google Scholar 

  45. Tumanyan, N., Bar-Tal, O., Amir, S., Bagon, S., Dekel, T.: Disentangling structure and appearance in ViT feature space. ACM Trans. Graph. 43, 1–6 (2023)

    Article  Google Scholar 

  46. Tumanyan, N., Bar-Tal, O., Bagon, S., Dekel, T.: Splicing ViT features for semantic appearance transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10748–10757 (2022)

    Google Scholar 

  47. Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  48. Wang, Q., et al.: Tracking everything everywhere all at once. In: International Conference on Computer Vision (2023)

    Google Scholar 

  49. Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 757–774. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_44

    Chapter  Google Scholar 

  50. Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)

    Google Scholar 

  51. Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D.: GMFlow: learning optical flow via global matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8111–8120 (2021)

    Google Scholar 

  52. Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1289–1297 (2017)

    Google Scholar 

  53. Xu, J., Wang, X.: Rethinking self-supervised correspondence learning: a video frame-level similarity perspective. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10075–10085 (2021)

    Google Scholar 

  54. Zhai, M., Xiang, X., Lv, N., Kong, X.: Optical flow and scene flow estimation: a survey. Pattern Recogn. 114, 107861 (2021)

    Article  Google Scholar 

  55. Zhang, J., et al.: A tale of two features: stable diffusion complements DINO for zero-shot semantic correspondence. arXiv:2305.15347 (2023)

  56. Zhang, J., et al.: Telling left from right: identifying geometry-aware semantic correspondence. arXiv:2311.17034 (2023)

  57. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv:2302.05543 (2023)

  58. Zhao, W., Liu, S., Guo, H., Wang, W., Liu, Y.: ParticleSfM: exploiting dense point trajectories for localizing moving cameras in the wild. In: European Conference on Computer Vision (2022)

    Google Scholar 

  59. Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: a large-scale synthetic dataset for long-term point tracking. In: ICCV (2023)

    Google Scholar 

  60. Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)

    Google Scholar 

Download references

Acknowledgements

We would like to thank Rafail Fridman for his insightful remarks and assistance. We would also like to thank the authors of Omnimotion for providing the trained weights for TAP-Vid-DAVIS and TAP-Vid-Kinetics videos. The project was supported by an ERC starting grant OmniVideo (10111768), by Shimon and Golde Picker, and by the Carolito Stiftung.

Dr. Bagon is a Robin Chemers Neustein AI Fellow. He received funding from the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center and MBZUAI-WIS Joint Program for AI Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Narek Tumanyan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 93760 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tumanyan, N., Singer, A., Bagon, S., Dekel, T. (2025). DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15084. Springer, Cham. https://doi.org/10.1007/978-3-031-73347-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73347-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73346-8

  • Online ISBN: 978-3-031-73347-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics