Abstract
Following the surge of popularity of Transformers in Computer Vision, several studies have attempted to determine whether they could be more robust to distribution shifts and provide better uncertainty estimates than Convolutional Neural Networks (CNNs). The almost unanimous conclusion is that they are, and it is often conjectured more or less explicitly that the reason of this supposed superiority is to be attributed to the self-attention mechanism. In this paper we perform extensive empirical analyses showing that recent state-of-the-art CNNs (particularly, ConvNeXt [20]) can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers. However, there is no clear winner. Therefore, although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks while also suffering from similar vulnerabilities such as texture, background, and simplicity biases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Consider that ViT-L/32 has about 307M parameters, ViT-L/16 has 305M, yet ViT-L/32 requires about 15GFLOPS, while ViT-L/16 requires about 61GFLOPS, and ViT-L/32 exhibits lower accuracy and robustness than ViT-B/32 [28].
- 3.
We understand that defining complexity is subjective. Here we assume that something that is visually more complex (having more colors, shapes, textures etc.) across the training set would require learning more complex features.
- 4.
We oversample OoD samples (\(4\times \)) so that both in-distribution and OoD datasets have 10000 samples each. We could rebalance them also by randomly sampling 2000 out of the 10000 in-distribution samples, but this could induce some variance in the metrics; we also observed that the average of this strategy coincides with the balancing strategy.
References
Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. arXiv e-Prints arXiv:1907.02893, July 2019
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bai, Y., Mei, J., Yuille, A., Xie, C.: Are transformers more robust than CNNs? In: NeurIPS (2021)
Condessa, F., Kovacevic, J., Bioucas-Dias, J.: Performance measures for classification systems with rejection. Pattern Recogn. (2015)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Sig. Syst. 2, 303–314 (1989)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 CVPR, pp. 248–255 (2009)
Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Sig. Process. Mag. 29(6), 141–142 (2012)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Fort, S., Ren, J., Lakshminarayanan, B.: Exploring the limits of Out-of-Distribution detection. In: NeurIPS (2021)
Fumera, G., Roli, F.: Support vector machines with embedded reject option. In: Lee, S.-W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 68–82. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45665-1_6
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: ICML 2017, pp. 1321–1330. JMLR.org (2017)
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. IN: ICCV (2021)
Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415 (2016). https://arxiv.org/abs/1606.08415
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR (2021)
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., Bengio, S.: Fantastic generalization measures and where to find them. In: ICLR (2020)
Kolesnikov, A., et al.: Big transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_29
Landgrebe, T.C.W., Tax, D.M.J., Paclík, P., Duin, R.P.W.: The interaction between classification and reject performance for distance-based reject-option classifiers. Pattern Recogn. Lett. 27(8), 908–917 (2006)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: CVPR (2022)
Malinin, A., Mlodozeniec, B., Gales, M.: Ensemble distribution distillation. In: ICLR (2020)
Minderer, M., et al.: Revisiting the calibration of modern neural networks. In: NeurIPS (2021)
Morrison, K., Gilby, B., Lipchak, C., Mattioli, A., Kovashka, A.: Exploring corruption robustness: inductive biases in vision transformers and mlp-mixers, vol. abs/2106.13122 (2021). http://arxiv.org/abs/2106.13122
Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P.H., Dokania, P.K.: Calibrating deep neural networks using focal loss. In: NeurIPS (2020)
Naeini, M.P., Cooper, G.F., Hauskrecht, M.: Obtaining well calibrated probabilities using Bayesian binning. In: Proceedigs of Conference on AAAI Artificial Intelligence 2015, pp. 2901–2907, January 2015
Neyshabur, B., Bhojanapalli, S., Mcallester, D., Srebro, N.: Exploring generalization in deep learning. In: Guyon, I., et al. (eds.) NeurIPS, vol. 30. Curran Associates, Inc. (2017)
Neyshabur, B., Bhojanapalli, S., Srebro, N.: A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In: ICLR (2018)
Paul, S., Chen, P.Y.: Vision transformers are robust learners. In: AAAI (2022)
Pinto, F., Torr, P., Dokania, P.: Are vision transformers always more robust than convolutional neural networks? In: NeurIPS Workshop on Distribution Shifts: Connecting Methods and Applications (2021)
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: ICML (2019)
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses (2021)
Sanyal, A., Torr, P.H.S., Dokania, P.K.: Stable rank normalization for improved generalization in neural networks and GANs. In: ICLR (2020)
Shah, H., Tamuly, K., Raghunathan, A., Jain, P., Netrapalli, P.: The pitfalls of simplicity bias in neural networks. In: NeurIPS (2020)
Tang, S., et al.: RobuStart: benchmarking robustness on architecture design and training techniques. arXiv (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: NeurIPS, pp. 10506–10518 (2019)
Wightman, R.: PyTorch image models (2019). https://github.com/rwightman/pytorch-image-models
Xiao, K., Engstrom, L., Ilyas, A., Madry, A.: Noise or signal: the role of image backgrounds in object recognition. In: ICLR (2021)
Yuan, L., et al.: Tokens-to-Token ViT: training vision transformers from scratch on ImageNet. In: ICCV (2021)
Zhang, C., et al.: Delving deep into the generalization of vision transformers under distribution shifts. In: CVPR (2022)
Acknowledgements
This work is supported by the UKRI grant: Turing AI Fellowship EP/W002981/1 and EPSRC/MURI grant: EP/N019474/1. We would like to thank the Royal Academy of Engineering and FiveAI. Francesco Pinto’s PhD is funded by the European Space Agency (ESA). PD would like to thank Anuj Sharma and Kemal Oksuz for their comments on the draft.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pinto, F., Torr, P.H.S., K. Dokania, P. (2022). An Impartial Take to the CNN vs Transformer Robustness Contest. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-19778-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19777-2
Online ISBN: 978-3-031-19778-9
eBook Packages: Computer ScienceComputer Science (R0)