[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

BootsTAP: Bootstrapped Training for Tracking-Any-Point

  • Conference paper
  • First Online:
Computer Vision – ACCV 2024 (ACCV 2024)

Abstract

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a self-supervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 99.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 129.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Balasingam, A., Chandler, J., Li, C., Zhang, Z., Balakrishnan, H.: Drivetrack: A benchmark for long-range point tracking in real-world videos. arXiv preprint arXiv:2312.09523 (2023)

  2. Bharadhwaj, H., Mottaghi, R., Gupta, A., Tulsiani, S.: Track2Act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation. arXiv preprint arXiv:2405.01527 (2024)

  3. Bian, W., Huang, Z., Shi, X., Dong, Y., Li, Y., Li, H.: Context-pips: Persistent independent particles demands context features. NeurIPS (2024)

    Google Scholar 

  4. Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multiscale contrastive random walks. In: Proc. CVPR (2022)

    Google Scholar 

  5. Boreczky, J.S., Rowe, L.A.: Comparison of video shot boundary detection techniques. Journal of Electronic Imaging 5(2), 122–128 (1996)

    Google Scholar 

  6. Bousmalis, K., Vezzani, G., Rao, D., Devin, C., Lee, A.X., Bauza, M., Davchev, T., Zhou, Y., Gupta, A., Raju, A., et al.: Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706 (2023)

  7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proc. CVPR. pp. 6299–6308 (2017)

    Google Scholar 

  8. Chen, W., Chen, L., Wang, R., Pollefeys, M.: Leap-vo: Long-term effective any point tracking for visual odometry. arXiv preprint arXiv:2401.01887 (2024)

  9. Dekel, T., Rubinstein, M., Liu, C., Freeman, W.T.: On the effectiveness of visible watermarks. In: Proc. CVPR (2017)

    Google Scholar 

  10. Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attend with deep architectures for image tracking. Neural computation 24(8), 2151–2184 (2012)

    Google Scholar 

  11. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proc. ICCV (2015)

    Google Scholar 

  12. Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., Yang, Y.: TAP-Vid: A benchmark for tracking any point in a video. NeurIPS (2022)

    Google Scholar 

  13. Doersch, C., Yang, Y., Vecerik, M., Gokay, D., Gupta, A., Aytar, Y., Carreira, J., Zisserman, A.: TAPIR: Tracking any point with per-frame initialization and temporal refinement. arXiv preprint arXiv:2306.08637 (2023)

  14. Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proc. ICCV (2017)

    Google Scholar 

  15. Földiák, P.: Learning invariance from transformation sequences. Neural computation 3(2), 194–200 (1991)

    Google Scholar 

  16. Goroshin, R., Bruna, J., Tompson, J., Eigen, D., LeCun, Y.: Unsupervised learning of spatiotemporally coherent metrics. In: Proc. ICCV (2015)

    Google Scholar 

  17. Goroshin, R., Mathieu, M.F., LeCun, Y.: Learning to linearize under uncertainty. NeurIPS (2015)

    Google Scholar 

  18. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)

  19. Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proc. CVPR (2022)

    Google Scholar 

  20. Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: NeurIPS (2020)

    Google Scholar 

  21. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proc. CVPR (2006)

    Google Scholar 

  22. Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: Tracking through occlusions using point trajectories. In: Proc. ECCV (2022)

    Google Scholar 

  23. Huang, H.P., Herrmann, C., Hur, J., Lu, E., Sargent, K., Stone, A., Yang, M.H., Sun, D.: Self-supervised autoflow. In: Proc. CVPR (2023)

    Google Scholar 

  24. Im, W., Lee, S., Yoon, S.E.: Semi-supervised learning of optical flow by flow supervisor. In: Proc. ECCV (2022)

    Google Scholar 

  25. Jabri, A., Owens, A., Efros, A.: Space-time correspondence as a contrastive random walk. NeurIPS 33, 19545–19560 (2020)

    Google Scholar 

  26. Janai, J., Guney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: Proc. ECCV (2018)

    Google Scholar 

  27. Janai, J., Guney, F., Wulff, J., Black, M.J., Geiger, A.: Slow flow: Exploiting high-speed cameras for accurate and diverse optical flow reference data. In: Proc. CVPR (2017)

    Google Scholar 

  28. Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: Correspondence transformer for matching across images. In: Proc. ICCV (2021)

    Google Scholar 

  29. Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: It is better to track together. arXiv preprint arXiv:2307.07635 (2023)

  30. Lai, W.S., Huang, J.B., Yang, M.H.: Semi-supervised learning for optical flow with generative adversarial networks (2017)

    Google Scholar 

  31. Lai, Z., Lu, E., Xie, W.: MAST: A memory-augmented self-supervised tracker. In: Proc. CVPR (2020)

    Google Scholar 

  32. Lai, Z., Xie, W.: Self-supervised learning for video correspondence flow. arXiv preprint arXiv:1905.00875 (2019)

  33. Li, R., Zhou, S., Liu, D.: Learning fine-grained features for pixel-wise video correspondences. In: Proc. ICCV (2023)

    Google Scholar 

  34. Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. NeurIPS 36 (2024)

    Google Scholar 

  35. Liu, L., Zhang, J., He, R., Liu, Y., Wang, Y., Tai, Y., Luo, D., Wang, C., Li, J., Huang, F.: Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation. In: Proc. CVPR (2020)

    Google Scholar 

  36. Liu, P., King, I., Lyu, M.R., Xu, J.: Ddflow: Learning optical flow with unlabeled data distillation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 8770–8777 (2019)

    Google Scholar 

  37. Liu, P., Lyu, M., King, I., Xu, J.: Selflow: Self-supervised learning of optical flow. In: Proc. CVPR (2019)

    Google Scholar 

  38. Liu, P., Lyu, M.R., King, I., Xu, J.: Learning by distillation: a self-supervised learning framework for optical flow estimation. IEEE PAMI 44(9), 5026–5041 (2021)

    Google Scholar 

  39. Marsal, R., Chabot, F., Loesch, A., Sahbi, H.: Brightflow: Brightness-change-aware unsupervised learning of optical flow. In: Proc. WACV (2023)

    Google Scholar 

  40. Mas, J., Fernandez, G.: Video shot boundary detection based on color histogram. In: TRECVID (2003)

    Google Scholar 

  41. Meister, S., Hur, J., Roth, S.: Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

    Google Scholar 

  42. Moing, G.L., Ponce, J., Schmid, C.: Dense optical tracking: Connecting the dots. In: Proc. CVPR (2024)

    Google Scholar 

  43. Neoral, M., Šerỳch, J., Matas, J.: MFT: Long-term tracking of every pixel. In: Proc. WACV (2024)

    Google Scholar 

  44. Novák, T., Šochman, J., Matas, J.: A new semi-supervised method improving optical flow on distant domains. In: Computer Vision Winter Workshop. vol. 3 (2020)

    Google Scholar 

  45. Ochs, P., Malik, J., Brox, T.: Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence 36(6), 1187–1200 (2013)

    Google Scholar 

  46. OpenAI: GPT-4V(ision) system card (September 25, 2023)

    Google Scholar 

  47. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proc. CVPR (2016)

    Google Scholar 

  48. Polajnar, J., Kvinikadze, E., Harley, A.W., Malenovskỳ, I.: Wing buzzing as a mechanism for generating vibrational signals in psyllids. Insect Science (2024)

    Google Scholar 

  49. Rajič, F., Ke, L., Tai, Y.W., Tang, C.K., Danelljan, M., Yu, F.: Segment anything meets point tracking. arXiv preprint arXiv:2307.01197 (2023)

  50. Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical flow estimation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 31 (2017)

    Google Scholar 

  51. Rubinstein, M., Liu, C., Freeman, W.T.: Towards longer long-range motion trajectories. In: Proc. BMVC (2012)

    Google Scholar 

  52. Sand, P., Teller, S.: Particle video: Long-range motion estimation using point trajectories. Proc. ICCV (2008)

    Google Scholar 

  53. Schmidt, A., Mohareri, O., DiMaio, S., Salcudean, S.E.: Surgical tattoos in infrared: A dataset for quantifying tissue tracking and mapping. IEEE Transactions on Medical Imaging (2024)

    Google Scholar 

  54. Shen, Y., Hui, L., Xie, J., Yang, J.: Self-supervised 3d scene flow estimation guided by superpoints. In: Proc. CVPR (2023)

    Google Scholar 

  55. Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence (2020)

    Google Scholar 

  56. Stone, A., Maurer, D., Ayvaci, A., Angelova, A., Jonschkowski, R.: Smurf: Self-teaching multi-frame unsupervised raft with full-image warping. In: Proc. CVPR (2021)

    Google Scholar 

  57. Sun, X., Harley, A.W., Guibas, L.J.: Refining pre-trained motion models. Proc. Intl. Conf. on Robotics and Automation (2024)

    Google Scholar 

  58. Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NeurIPS (2017)

    Google Scholar 

  59. Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  60. Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: Proc. ECCV (2020)

    Google Scholar 

  61. Truong, B.T., Dorai, C., Venkatesh, S.: New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: Proceedings of the eighth ACM international conference on Multimedia. pp. 219–227 (2000)

    Google Scholar 

  62. Vecerik, M., Doersch, C., Yang, Y., Davchev, T., Aytar, Y., Zhou, G., Hadsell, R., Agapito, L., Scholz, J.: RoboTAP: Tracking arbitrary points for few-shot visual imitation. In: Proc. Intl. Conf. on Robotics and Automation (2024)

    Google Scholar 

  63. Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proc. ECCV (2018)

    Google Scholar 

  64. Wang, J., Karaev, N., Rupprecht, C., Novotny, D.: Visual geometry grounded deep structure from motion. Proc. CVPR (2024)

    Google Scholar 

  65. Wang, Q., Chang, Y.Y., Cai, R., Li, Z., Hariharan, B., Holynski, A., Snavely, N.: Tracking everything everywhere all at once. In: Proc. ICCV (2023)

    Google Scholar 

  66. Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proc. ICCV (2015)

    Google Scholar 

  67. Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proc. CVPR (2019)

    Google Scholar 

  68. Wang, Y., Yang, Y., Yang, Z., Zhao, L., Wang, P., Xu, W.: Occlusion aware unsupervised learning of optical flow. In: Proc. CVPR (2018)

    Google Scholar 

  69. Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., Abbeel, P.: Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025 (2023)

  70. Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learning of invariances. Neural computation 14(4), 715–770 (2002)

    Google Scholar 

  71. Yu, E., Blackburn-Matzen, K., Nguyen, C., Wang, O., Habib Kazi, R., Bousseau, A.: VideoDoodles: Hand-drawn animations on videos with scene-aware canvases. ACM Transactions on Graphics 42(4), 1–12 (2023)

    Google Scholar 

  72. Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In: ECCV 2016 Workshops (2016)

    Google Scholar 

  73. Yuan, C., Wen, C., Zhang, T., Gao, Y.: General flow as foundation affordance for scalable robot learning. arXiv preprint arXiv:2401.11439 (2024)

  74. Yusoff, Y., Christmas, W.J., Kittler, J.: Video shot cut detection using adaptive thresholding. In: Proc. BMVC (2000)

    Google Scholar 

  75. Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: A large-scale synthetic dataset for long-term point tracking. In: Proc. CVPR (2023)

    Google Scholar 

Download references

Acknowledgements

We thank Jon Scholz, Stannis Zhou, Mel Vecerik, Yusuf Aytar, Viorica Patraucean, Mehdi Sajjadi, Daniel Zoran, and Nando de Freitas for valuable discussions and support, and David Bridson, Lucas Smaira, and Michael King for help on datasets.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carl Doersch .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7612 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Doersch, C. et al. (2025). BootsTAP: Bootstrapped Training for Tracking-Any-Point. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15473. Springer, Singapore. https://doi.org/10.1007/978-981-96-0901-7_28

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0901-7_28

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0900-0

  • Online ISBN: 978-981-96-0901-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics