ViC-MAE: Self-supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Jefferson Hernandez¹³,
Ruben Villegas¹⁴ &
Vicente Ordonez¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15062))

Included in the following conference series:

European Conference on Computer Vision

164 Accesses
2 Citations

Abstract

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global representation obtained by pooling the local features learned under an MAE reconstruction loss and using this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time, ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark. When training on videos and images from diverse datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best-supervised method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 54.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 69.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Efficient Image Pre-training with Siamese Cropped Masked Autoencoders

Omni-Sourced Webly-Supervised Learning for Video Recognition

Frozen CLIP Models are Efficient Video Learners

Notes

1.
See supplemental material for an evaluation of what we tried and did not work when combining negative-free methods with masked image modeling.
2.
Previous methods also use different backbones [31, 82, 84] i.e.ResNet-50. They obtain 54.5%, 33.8%, and 55.6% top-1 accuracies on linear evaluation on ImageNet-1k. Since those works do not use the same setting, we do not include them here.
3.
These are the only publicly available checkpoints of OmniMAE.

References

Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 37–45 (2015)
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Assran, M., et al.: Masked siamese networks for label-efficient learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13691, pp. 456–473. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_26
Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. In: International Conference on Learning Representations (2021)
Google Scholar
Bardes, A., et al.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)
Bardes, A., Ponce, J., Lecun, Y.: VICReg: variance-invariance-covariance regularization for self-supervised learning. In: International Conference on Learning Representations, ICLR 2022 (2022)
Google Scholar
Berg, T., Liu, J., Woo Lee, S., Alexander, M.L., Jacobs, D.W., Belhumeur, P.N.: Birdsnap: large-scale fine-grained visual categorization of birds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2011–2018 (2014)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, pp. 813–824. PMLR (2021)
Google Scholar
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Cai, M., et al.: ViP-LLaVA: making large multimodal models understand arbitrary visual prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 12914–12923 (2024)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)
Carreira, J., Noland, E., Hillier, C., Zisserman, A.: A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987 (2019)
Cascante-Bonilla, P., et al.: Going beyond nouns with vision & language models using synthetic data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 20155–20165 (2023)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22243–22255 (2020)
Google Scholar
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014)
Google Scholar
Dehghani, M., et al.: Scaling vision transformers to 22 billion parameters. In: International Conference on Machine Learning, pp. 7480–7512. PMLR (2023)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Diba, A., Sharma, V., Gool, L.V., Stiefelhagen, R.: DynamoNet: dynamic action and motion network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6192–6201 (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178. IEEE (2004)
Google Scholar
Feichtenhofer, C., Fan, H., Li, Y., He, K.: Masked autoencoders as spatiotemporal learners. In: Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309 (2021)
Google Scholar
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)
Google Scholar
Girdhar, R., El-Nouby, A., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: OmniMAE: single model masked pretraining on images and videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10406–10417 (2023)
Google Scholar
Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: Omnivore: a single model for many visual modalities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16102–16112 (2022)
Google Scholar
Gordon, D., Ehsani, K., Fox, D., Farhadi, A.: Watching the world go by: representation learning from unlabeled videos. arXiv preprint arXiv:2003.07990 (2020)
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)
Google Scholar
Gupta, A., Wu, J., Deng, J., Li, F.F.: Siamese masked autoencoders. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 40676–40693. Curran Associates, Inc. (2023). https://proceedings.neurips.cc/paper_files/paper/2023/file/7ffb9f1b57628932518505b532301603-Paper-Conference.pdf
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, R., Cascante-Bonilla, P., Yang, Z., Berg, A.C., Ordonez, V.: Improved visual grounding through self-consistent explanations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13095–13105 (2024)
Google Scholar
He, R., Cascante-Bonilla, P., Yang, Z., Berg, A.C., Ordonez, V.: Learning from models and data for visual grounding (2024). https://arxiv.org/abs/2403.13804
Huang, Z., et al.: Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532 (2022)
Kay, W., et al.: The kinetics human action video dataset (2017). https://arxiv.org/abs/1705.06950
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)
Google Scholar
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report 0, University of Toronto, Toronto, Ontario (2009). https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
Lehner, J., Alkin, B., Fürst, A., Rumetshofer, E., Miklautz, L., Hochreiter, S.: Contrastive tuning: a little help to make masked autoencoders forget. arXiv preprint arXiv:2304.10520 (2023)
Li, K., et al.: UniFormerV2: spatiotemporal learning by arming image ViTs with video uniformer. arXiv preprint arXiv:2211.09552 (2022)
Li, K., et al.: Unmasked teacher: towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058 (2023)
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13669, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_17
Li, Y., et al.: MViTv2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4804–4814 (2022)
Google Scholar
Likhosherstov, V., et al.: PolyViT: co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993 (2021)
Liu, Z., et al.: Swin Transformer V2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)
Google Scholar
Liu, Z., et al.: Video Swin Transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Google Scholar
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: International Conference on Learning Representations (2016)
Google Scholar
Lu, C.Z., Jin, X., Huang, Z., Hou, Q., Cheng, M.M., Feng, J.: CMAE-V: contrastive masked autoencoders for video action recognition. arXiv preprint arXiv:2301.06018 (2023)
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: 4th International Conference on Learning Representations, ICLR 2016 (2016)
Google Scholar
Mishra, S., et al.: A simple, efficient and scalable contrastive masked autoencoder for learning visual representations. arXiv preprint arXiv:2210.16870 (2022)
Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 502–508 (2020)
Article Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE (2008)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505. IEEE (2012)
Google Scholar
Parthasarathy, N., Eslami, S., Carreira, J., Hénaff, O.J.: Self-supervised video pretraining yields strong image representations. arXiv preprint arXiv:2210.06433 (2022)
Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2701–2710 (2017)
Google Scholar
Piergiovanni, A., Kuo, W., Angelova, A.: Rethinking video ViTs: sparse video tubes for joint image and video learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2214–2224 (2023)
Google Scholar
Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2021)
Google Scholar
Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1655–1668 (2018)
Article Google Scholar
Shrivastava, A., Selvaraju, R.R., Naik, N., Ordonez, V.: CLIP-Lite: information efficient visual representation learning with language supervision. In: International Conference on Artificial Intelligence and Statistics, pp. 8433–8447. PMLR (2023)
Google Scholar
Singh, N., Wu, C.W., Orife, I., Kalayeh, M.: Looking similar sounding different: leveraging counterfactual cross-modal pairs for audiovisual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26907–26918 (2024)
Google Scholar
Sriram, A., Gaidon, A., Wu, J., Niebles, J.C., Fei-Fei, L., Adeli, E.: HomE: homography-equivariant video representation learning. arXiv preprint arXiv:2306.01623 (2023)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852. PMLR (2015)
Google Scholar
Srivastava, S., Sharma, G.: OmniVec: learning robust representations with cross modal sharing. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1236–1248 (2024)
Google Scholar
Tang, Y., Shimada, D., Bi, J., Xu, C.: AVicuna: audio-visual LLM with interleaver and context-boundary alignment for temporal referential dialogue (2024). https://arxiv.org/abs/2403.16276
Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Touvron, H., Cord, M., Jégou, H.: DeiT III: revenge of the ViT. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13684, pp. 516–533. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_30
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106 (2016)
Google Scholar
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII 14. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51
Chapter Google Scholar
Wang, L., et al.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)
Google Scholar
Wang, R., et al.: BEVT: BERT pretraining of video transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14733–14743 (2022)
Google Scholar
Wang, R., et al.: Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6312–6322 (2023)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2566–2576 (2019)
Google Scholar
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678 (2022)
Google Scholar
Wu, H., Wang, X.: Contrastive learning of image representations with cross-video cycle-consistency. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10149–10159 (2021)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)
Google Scholar
Xu, J., Wang, X.: Rethinking self-supervised correspondence learning: a video frame-level similarity perspective. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10075–10085 (2021)
Google Scholar
Yan, S., et al.: Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3333–3343 (2022)
Google Scholar
Zhang, B., et al.: Co-training transformer with videos and images improves action recognition. arXiv preprint arXiv:2112.07175 (2021)
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2017)
Article Google Scholar
Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)

Download references

Acknowledgements

The authors would like to thank Google Cloud and the CURe program from Google Research for partially providing funding for this research effort. We are also thankful for support from the Department of Computer Science at Rice University, the National Science Foundation through NSF CAREER Award #2201710, and the Ken Kennedy Institute at Rice University. We also thank anonymous reviewers for their feedback and encouragement.

Author information

Authors and Affiliations

Rice University, Houston, USA
Jefferson Hernandez & Vicente Ordonez
Google DeepMind, London, UK
Ruben Villegas

Authors

Jefferson Hernandez
View author publications
You can also search for this author in PubMed Google Scholar
Ruben Villegas
View author publications
You can also search for this author in PubMed Google Scholar
Vicente Ordonez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jefferson Hernandez .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 415 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hernandez, J., Villegas, R., Ordonez, V. (2025). ViC-MAE: Self-supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15062. Springer, Cham. https://doi.org/10.1007/978-3-031-73235-5_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-73235-5_25
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73234-8
Online ISBN: 978-3-031-73235-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ViC-MAE: Self-supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Image Pre-training with Siamese Cropped Masked Autoencoders

Omni-Sourced Webly-Supervised Learning for Video Recognition

Frozen CLIP Models are Efficient Video Learners

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 415 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

ViC-MAE: Self-supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Image Pre-training with Siamese Cropped Masked Autoencoders

Omni-Sourced Webly-Supervised Learning for Video Recognition

Frozen CLIP Models are Efficient Video Learners

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 415 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation