An Impartial Take to the CNN vs Transformer Robustness Contest

Francesco Pinto^12,13,
Philip H. S. Torr¹² &
Puneet K. Dokania^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13673))

Included in the following conference series:

European Conference on Computer Vision

2877 Accesses
17 Citations

Abstract

Following the surge of popularity of Transformers in Computer Vision, several studies have attempted to determine whether they could be more robust to distribution shifts and provide better uncertainty estimates than Convolutional Neural Networks (CNNs). The almost unanimous conclusion is that they are, and it is often conjectured more or less explicitly that the reason of this supposed superiority is to be attributed to the self-attention mechanism. In this paper we perform extensive empirical analyses showing that recent state-of-the-art CNNs (particularly, ConvNeXt [20]) can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers. However, there is no clear winner. Therefore, although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks while also suffering from similar vulnerabilities such as texture, background, and simplicity biases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 79.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 99.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DoubleViT: Pushing Transformers Towards the End Because of Convolutions

Inception and ResNet: Same Training, Same Features

The Current State of the Art in Deep Learning for Image Classification: A Review

Notes

1.
We omit ViT-B/32 ViT-L/32 as we find them to always underperform with respect to ViT-B/16 and ViT-L/16 (a similar observation was made in [28]). Similarly, we also omit DeiT [36] as it underperforms compared to SwinTransformers.
2.
Consider that ViT-L/32 has about 307M parameters, ViT-L/16 has 305M, yet ViT-L/32 requires about 15GFLOPS, while ViT-L/16 requires about 61GFLOPS, and ViT-L/32 exhibits lower accuracy and robustness than ViT-B/32 [28].
3.
We understand that defining complexity is subjective. Here we assume that something that is visually more complex (having more colors, shapes, textures etc.) across the training set would require learning more complex features.
4.
We oversample OoD samples (\(4\times \)) so that both in-distribution and OoD datasets have 10000 samples each. We could rebalance them also by randomly sampling 2000 out of the 10000 in-distribution samples, but this could induce some variance in the metrics; we also observed that the average of this strategy coincides with the balancing strategy.

References

Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. arXiv e-Prints arXiv:1907.02893, July 2019
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bai, Y., Mei, J., Yuille, A., Xie, C.: Are transformers more robust than CNNs? In: NeurIPS (2021)
Google Scholar
Condessa, F., Kovacevic, J., Bioucas-Dias, J.: Performance measures for classification systems with rejection. Pattern Recogn. (2015)
Google Scholar
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Sig. Syst. 2, 303–314 (1989)
Article MathSciNet MATH Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 CVPR, pp. 248–255 (2009)
Google Scholar
Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Sig. Process. Mag. 29(6), 141–142 (2012)
Article Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Fort, S., Ren, J., Lakshminarayanan, B.: Exploring the limits of Out-of-Distribution detection. In: NeurIPS (2021)
Google Scholar
Fumera, G., Roli, F.: Support vector machines with embedded reject option. In: Lee, S.-W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 68–82. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45665-1_6
Chapter MATH Google Scholar
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: ICML 2017, pp. 1321–1330. JMLR.org (2017)
Google Scholar
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. IN: ICCV (2021)
Google Scholar
Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415 (2016). https://arxiv.org/abs/1606.08415
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR (2021)
Google Scholar
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Article MATH Google Scholar
Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., Bengio, S.: Fantastic generalization measures and where to find them. In: ICLR (2020)
Google Scholar
Kolesnikov, A., et al.: Big transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_29
Chapter Google Scholar
Landgrebe, T.C.W., Tax, D.M.J., Paclík, P., Duin, R.P.W.: The interaction between classification and reject performance for distance-based reject-option classifiers. Pattern Recogn. Lett. 27(8), 908–917 (2006)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: CVPR (2022)
Google Scholar
Malinin, A., Mlodozeniec, B., Gales, M.: Ensemble distribution distillation. In: ICLR (2020)
Google Scholar
Minderer, M., et al.: Revisiting the calibration of modern neural networks. In: NeurIPS (2021)
Google Scholar
Morrison, K., Gilby, B., Lipchak, C., Mattioli, A., Kovashka, A.: Exploring corruption robustness: inductive biases in vision transformers and mlp-mixers, vol. abs/2106.13122 (2021). http://arxiv.org/abs/2106.13122
Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P.H., Dokania, P.K.: Calibrating deep neural networks using focal loss. In: NeurIPS (2020)
Google Scholar
Naeini, M.P., Cooper, G.F., Hauskrecht, M.: Obtaining well calibrated probabilities using Bayesian binning. In: Proceedigs of Conference on AAAI Artificial Intelligence 2015, pp. 2901–2907, January 2015
Google Scholar
Neyshabur, B., Bhojanapalli, S., Mcallester, D., Srebro, N.: Exploring generalization in deep learning. In: Guyon, I., et al. (eds.) NeurIPS, vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Neyshabur, B., Bhojanapalli, S., Srebro, N.: A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In: ICLR (2018)
Google Scholar
Paul, S., Chen, P.Y.: Vision transformers are robust learners. In: AAAI (2022)
Google Scholar
Pinto, F., Torr, P., Dokania, P.: Are vision transformers always more robust than convolutional neural networks? In: NeurIPS Workshop on Distribution Shifts: Connecting Methods and Applications (2021)
Google Scholar
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: ICML (2019)
Google Scholar
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses (2021)
Google Scholar
Sanyal, A., Torr, P.H.S., Dokania, P.K.: Stable rank normalization for improved generalization in neural networks and GANs. In: ICLR (2020)
Google Scholar
Shah, H., Tamuly, K., Raghunathan, A., Jain, P., Netrapalli, P.: The pitfalls of simplicity bias in neural networks. In: NeurIPS (2020)
Google Scholar
Tang, S., et al.: RobuStart: benchmarking robustness on architecture design and training techniques. arXiv (2021)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Google Scholar
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: NeurIPS, pp. 10506–10518 (2019)
Google Scholar
Wightman, R.: PyTorch image models (2019). https://github.com/rwightman/pytorch-image-models
Xiao, K., Engstrom, L., Ilyas, A., Madry, A.: Noise or signal: the role of image backgrounds in object recognition. In: ICLR (2021)
Google Scholar
Yuan, L., et al.: Tokens-to-Token ViT: training vision transformers from scratch on ImageNet. In: ICCV (2021)
Google Scholar
Zhang, C., et al.: Delving deep into the generalization of vision transformers under distribution shifts. In: CVPR (2022)
Google Scholar

Download references

Acknowledgements

This work is supported by the UKRI grant: Turing AI Fellowship EP/W002981/1 and EPSRC/MURI grant: EP/N019474/1. We would like to thank the Royal Academy of Engineering and FiveAI. Francesco Pinto’s PhD is funded by the European Space Agency (ESA). PD would like to thank Anuj Sharma and Kemal Oksuz for their comments on the draft.

Author information

Authors and Affiliations

University of Oxford, Oxford, UK
Francesco Pinto, Philip H. S. Torr & Puneet K. Dokania
Five AI Ltd., Cambridge, UK
Francesco Pinto & Puneet K. Dokania

Authors

Francesco Pinto
View author publications
You can also search for this author in PubMed Google Scholar
Philip H. S. Torr
View author publications
You can also search for this author in PubMed Google Scholar
Puneet K. Dokania
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesco Pinto .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 748 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pinto, F., Torr, P.H.S., K. Dokania, P. (2022). An Impartial Take to the CNN vs Transformer Robustness Contest. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-19778-9_27
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19777-2
Online ISBN: 978-3-031-19778-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Impartial Take to the CNN vs Transformer Robustness Contest

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DoubleViT: Pushing Transformers Towards the End Because of Convolutions

Inception and ResNet: Same Training, Same Features

The Current State of the Art in Deep Learning for Image Classification: A Review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 748 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

An Impartial Take to the CNN vs Transformer Robustness Contest

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DoubleViT: Pushing Transformers Towards the End Because of Convolutions

Inception and ResNet: Same Training, Same Features

The Current State of the Art in Deep Learning for Image Classification: A Review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 748 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation