BootsTAP: Bootstrapped Training for Tracking-Any-Point

Carl Doersch¹²,
Pauline Luc¹²,
Yi Yang¹²,
Dilara Gokay¹²,
Skanda Koppula¹²,
Ankush Gupta¹²,
Joseph Heyward¹²,
Ignacio Rocco¹²,
Ross Goroshin¹²,
João Carreira¹² &
…
Andrew Zisserman^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15473))

Included in the following conference series:

Asian Conference on Computer Vision

98 Accesses

Abstract

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a self-supervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 99.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 129.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Direct-from-Video: Unsupervised NRSfM

DOVE: Learning Deformable 3D Objects by Watching Videos

Article Open access 15 June 2023

iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning

References

Balasingam, A., Chandler, J., Li, C., Zhang, Z., Balakrishnan, H.: Drivetrack: A benchmark for long-range point tracking in real-world videos. arXiv preprint arXiv:2312.09523 (2023)
Bharadhwaj, H., Mottaghi, R., Gupta, A., Tulsiani, S.: Track2Act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation. arXiv preprint arXiv:2405.01527 (2024)
Bian, W., Huang, Z., Shi, X., Dong, Y., Li, Y., Li, H.: Context-pips: Persistent independent particles demands context features. NeurIPS (2024)
Google Scholar
Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multiscale contrastive random walks. In: Proc. CVPR (2022)
Google Scholar
Boreczky, J.S., Rowe, L.A.: Comparison of video shot boundary detection techniques. Journal of Electronic Imaging 5(2), 122–128 (1996)
Google Scholar
Bousmalis, K., Vezzani, G., Rao, D., Devin, C., Lee, A.X., Bauza, M., Davchev, T., Zhou, Y., Gupta, A., Raju, A., et al.: Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706 (2023)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proc. CVPR. pp. 6299–6308 (2017)
Google Scholar
Chen, W., Chen, L., Wang, R., Pollefeys, M.: Leap-vo: Long-term effective any point tracking for visual odometry. arXiv preprint arXiv:2401.01887 (2024)
Dekel, T., Rubinstein, M., Liu, C., Freeman, W.T.: On the effectiveness of visible watermarks. In: Proc. CVPR (2017)
Google Scholar
Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attend with deep architectures for image tracking. Neural computation 24(8), 2151–2184 (2012)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proc. ICCV (2015)
Google Scholar
Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., Yang, Y.: TAP-Vid: A benchmark for tracking any point in a video. NeurIPS (2022)
Google Scholar
Doersch, C., Yang, Y., Vecerik, M., Gokay, D., Gupta, A., Aytar, Y., Carreira, J., Zisserman, A.: TAPIR: Tracking any point with per-frame initialization and temporal refinement. arXiv preprint arXiv:2306.08637 (2023)
Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proc. ICCV (2017)
Google Scholar
Földiák, P.: Learning invariance from transformation sequences. Neural computation 3(2), 194–200 (1991)
Google Scholar
Goroshin, R., Bruna, J., Tompson, J., Eigen, D., LeCun, Y.: Unsupervised learning of spatiotemporally coherent metrics. In: Proc. ICCV (2015)
Google Scholar
Goroshin, R., Mathieu, M.F., LeCun, Y.: Learning to linearize under uncertainty. NeurIPS (2015)
Google Scholar
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proc. CVPR (2022)
Google Scholar
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: NeurIPS (2020)
Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proc. CVPR (2006)
Google Scholar
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: Tracking through occlusions using point trajectories. In: Proc. ECCV (2022)
Google Scholar
Huang, H.P., Herrmann, C., Hur, J., Lu, E., Sargent, K., Stone, A., Yang, M.H., Sun, D.: Self-supervised autoflow. In: Proc. CVPR (2023)
Google Scholar
Im, W., Lee, S., Yoon, S.E.: Semi-supervised learning of optical flow by flow supervisor. In: Proc. ECCV (2022)
Google Scholar
Jabri, A., Owens, A., Efros, A.: Space-time correspondence as a contrastive random walk. NeurIPS 33, 19545–19560 (2020)
Google Scholar
Janai, J., Guney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: Proc. ECCV (2018)
Google Scholar
Janai, J., Guney, F., Wulff, J., Black, M.J., Geiger, A.: Slow flow: Exploiting high-speed cameras for accurate and diverse optical flow reference data. In: Proc. CVPR (2017)
Google Scholar
Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: Correspondence transformer for matching across images. In: Proc. ICCV (2021)
Google Scholar
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: It is better to track together. arXiv preprint arXiv:2307.07635 (2023)
Lai, W.S., Huang, J.B., Yang, M.H.: Semi-supervised learning for optical flow with generative adversarial networks (2017)
Google Scholar
Lai, Z., Lu, E., Xie, W.: MAST: A memory-augmented self-supervised tracker. In: Proc. CVPR (2020)
Google Scholar
Lai, Z., Xie, W.: Self-supervised learning for video correspondence flow. arXiv preprint arXiv:1905.00875 (2019)
Li, R., Zhou, S., Liu, D.: Learning fine-grained features for pixel-wise video correspondences. In: Proc. ICCV (2023)
Google Scholar
Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. NeurIPS 36 (2024)
Google Scholar
Liu, L., Zhang, J., He, R., Liu, Y., Wang, Y., Tai, Y., Luo, D., Wang, C., Li, J., Huang, F.: Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation. In: Proc. CVPR (2020)
Google Scholar
Liu, P., King, I., Lyu, M.R., Xu, J.: Ddflow: Learning optical flow with unlabeled data distillation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 8770–8777 (2019)
Google Scholar
Liu, P., Lyu, M., King, I., Xu, J.: Selflow: Self-supervised learning of optical flow. In: Proc. CVPR (2019)
Google Scholar
Liu, P., Lyu, M.R., King, I., Xu, J.: Learning by distillation: a self-supervised learning framework for optical flow estimation. IEEE PAMI 44(9), 5026–5041 (2021)
Google Scholar
Marsal, R., Chabot, F., Loesch, A., Sahbi, H.: Brightflow: Brightness-change-aware unsupervised learning of optical flow. In: Proc. WACV (2023)
Google Scholar
Mas, J., Fernandez, G.: Video shot boundary detection based on color histogram. In: TRECVID (2003)
Google Scholar
Meister, S., Hur, J., Roth, S.: Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
Google Scholar
Moing, G.L., Ponce, J., Schmid, C.: Dense optical tracking: Connecting the dots. In: Proc. CVPR (2024)
Google Scholar
Neoral, M., Šerỳch, J., Matas, J.: MFT: Long-term tracking of every pixel. In: Proc. WACV (2024)
Google Scholar
Novák, T., Šochman, J., Matas, J.: A new semi-supervised method improving optical flow on distant domains. In: Computer Vision Winter Workshop. vol. 3 (2020)
Google Scholar
Ochs, P., Malik, J., Brox, T.: Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence 36(6), 1187–1200 (2013)
Google Scholar
OpenAI: GPT-4V(ision) system card (September 25, 2023)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proc. CVPR (2016)
Google Scholar
Polajnar, J., Kvinikadze, E., Harley, A.W., Malenovskỳ, I.: Wing buzzing as a mechanism for generating vibrational signals in psyllids. Insect Science (2024)
Google Scholar
Rajič, F., Ke, L., Tai, Y.W., Tang, C.K., Danelljan, M., Yu, F.: Segment anything meets point tracking. arXiv preprint arXiv:2307.01197 (2023)
Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical flow estimation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 31 (2017)
Google Scholar
Rubinstein, M., Liu, C., Freeman, W.T.: Towards longer long-range motion trajectories. In: Proc. BMVC (2012)
Google Scholar
Sand, P., Teller, S.: Particle video: Long-range motion estimation using point trajectories. Proc. ICCV (2008)
Google Scholar
Schmidt, A., Mohareri, O., DiMaio, S., Salcudean, S.E.: Surgical tattoos in infrared: A dataset for quantifying tissue tracking and mapping. IEEE Transactions on Medical Imaging (2024)
Google Scholar
Shen, Y., Hui, L., Xie, J., Yang, J.: Self-supervised 3d scene flow estimation guided by superpoints. In: Proc. CVPR (2023)
Google Scholar
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence (2020)
Google Scholar
Stone, A., Maurer, D., Ayvaci, A., Angelova, A., Jonschkowski, R.: Smurf: Self-teaching multi-frame unsupervised raft with full-image warping. In: Proc. CVPR (2021)
Google Scholar
Sun, X., Harley, A.W., Guibas, L.J.: Refining pre-trained motion models. Proc. Intl. Conf. on Robotics and Automation (2024)
Google Scholar
Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NeurIPS (2017)
Google Scholar
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: Proc. ECCV (2020)
Google Scholar
Truong, B.T., Dorai, C., Venkatesh, S.: New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: Proceedings of the eighth ACM international conference on Multimedia. pp. 219–227 (2000)
Google Scholar
Vecerik, M., Doersch, C., Yang, Y., Davchev, T., Aytar, Y., Zhou, G., Hadsell, R., Agapito, L., Scholz, J.: RoboTAP: Tracking arbitrary points for few-shot visual imitation. In: Proc. Intl. Conf. on Robotics and Automation (2024)
Google Scholar
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proc. ECCV (2018)
Google Scholar
Wang, J., Karaev, N., Rupprecht, C., Novotny, D.: Visual geometry grounded deep structure from motion. Proc. CVPR (2024)
Google Scholar
Wang, Q., Chang, Y.Y., Cai, R., Li, Z., Hariharan, B., Holynski, A., Snavely, N.: Tracking everything everywhere all at once. In: Proc. ICCV (2023)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proc. ICCV (2015)
Google Scholar
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proc. CVPR (2019)
Google Scholar
Wang, Y., Yang, Y., Yang, Z., Zhao, L., Wang, P., Xu, W.: Occlusion aware unsupervised learning of optical flow. In: Proc. CVPR (2018)
Google Scholar
Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., Abbeel, P.: Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025 (2023)
Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learning of invariances. Neural computation 14(4), 715–770 (2002)
Google Scholar
Yu, E., Blackburn-Matzen, K., Nguyen, C., Wang, O., Habib Kazi, R., Bousseau, A.: VideoDoodles: Hand-drawn animations on videos with scene-aware canvases. ACM Transactions on Graphics 42(4), 1–12 (2023)
Google Scholar
Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In: ECCV 2016 Workshops (2016)
Google Scholar
Yuan, C., Wen, C., Zhang, T., Gao, Y.: General flow as foundation affordance for scalable robot learning. arXiv preprint arXiv:2401.11439 (2024)
Yusoff, Y., Christmas, W.J., Kittler, J.: Video shot cut detection using adaptive thresholding. In: Proc. BMVC (2000)
Google Scholar
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: A large-scale synthetic dataset for long-term point tracking. In: Proc. CVPR (2023)
Google Scholar

Download references

Acknowledgements

We thank Jon Scholz, Stannis Zhou, Mel Vecerik, Yusuf Aytar, Viorica Patraucean, Mehdi Sajjadi, Daniel Zoran, and Nando de Freitas for valuable discussions and support, and David Bridson, Lucas Smaira, and Michael King for help on datasets.

Author information

Authors and Affiliations

Google DeepMind, London, UK
Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira & Andrew Zisserman
VGG, Department of Engineering Science, University of Oxford, Oxford, UK
Andrew Zisserman

Authors

Carl Doersch
View author publications
You can also search for this author in PubMed Google Scholar
Pauline Luc
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Dilara Gokay
View author publications
You can also search for this author in PubMed Google Scholar
Skanda Koppula
View author publications
You can also search for this author in PubMed Google Scholar
Ankush Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Heyward
View author publications
You can also search for this author in PubMed Google Scholar
Ignacio Rocco
View author publications
You can also search for this author in PubMed Google Scholar
Ross Goroshin
View author publications
You can also search for this author in PubMed Google Scholar
João Carreira
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carl Doersch .

Editor information

Editors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, Korea (Republic of)
Minsu Cho
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Ivan Laptev
Google, Mountain View, CA, USA
Du Tran
National University of Singapore, Singapore, Singapore
Angela Yao
Peking University, Beijing, China
Hongbin Zha

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7612 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Doersch, C. et al. (2025). BootsTAP: Bootstrapped Training for Tracking-Any-Point. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15473. Springer, Singapore. https://doi.org/10.1007/978-981-96-0901-7_28

Download citation

DOI: https://doi.org/10.1007/978-981-96-0901-7_28
Published: 08 December 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0900-0
Online ISBN: 978-981-96-0901-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BootsTAP: Bootstrapped Training for Tracking-Any-Point

Abstract

Access this chapter

Subscribe and save

Buy Now