Abstract
While state-of-the-art vision transformer models achieve promising results in image classification, they are computationally expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we therefore introduce a differentiable parameter-free Adaptive Token Sampler (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not constant anymore and varies for each input image. By integrating ATS as an additional layer within the current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to the off-the-shelf pre-trained vision transformers as a plug and play module, thus reducing their GFLOPs without any additional training. Moreover, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate the efficiency of our module in both image and video classification tasks by adding it to multiple SOTA vision transformers. Our proposed module improves the SOTA by reducing their computational costs (GFLOPs) by 2\(\times \), while preserving their accuracy on the ImageNet, Kinetics-400, and Kinetics-600 datasets. The code is available at https://adaptivetokensampling.github.io/.
M. Fayyaz, S. A. Koohpayegani, and F. R. Jafari—Equal Contribution
M. Fayyaz and S. A. Koohpayegani—Work has been done during an internship at Microsoft.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding. In: International Conference on Machine Learning (ICML) (2021)
Bulat, A., Perez Rua, J.M., Sudhakaran, S., Martinez, B., Tzimiropoulos, G.: Space-time mixing attention for video transformer. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. In: arXiv preprint. arXiv:1808.01340v1 (2018)
Chen, C.F., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. In: arXiv preprint. arXiv:1904.10509 (2019)
Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., Shen, C.: Conditional positional encodings for vision transformers. In: arXiv preprint. arXiv:2102.10882 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Diba, A., et al.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 299–315. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_18
Diba, A., et al.: Large scale holistic video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 593–610. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_35
Diba, A., Sharma, V., Gool, L.V., Stiefelhagen, R.: Dynamonet: dynamic action and motion network. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D.: More Is Less: learning Efficient Video Representations by Temporal Aggregation Modules. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Fayyaz, M., Bahrami, E., Diba, A., Noroozi, M., Adeli, E., Van Gool, L., Gall, J.: 3d cnns with adaptive temporal feature resolutions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: IEEE/CVF international conference on computer vision (ICCV) (2019)
Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. In: arXiv preprint. arXiv:1412.6115 (2014)
Goyal, S., Choudhury, A.R., Raje, S.M., Chakaravarthy, V.T., Sabharwal, Y., Verma, A.: Power-bert: accelerating bert inference via progressive word-vector elimination. In: International Conference on Machine Learning (ICML) (2020)
Guo, Q., Qiu, X., Liu, P., Shao, Y., Xue, X., Zhang, Z.: Star-transformer. In: arXiv preprint. arXiv:1902.09113 (2019)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2017)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. In: arXiv preprint. arXiv:1704.04861 (2017)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: arXiv preprint. arXiv:1405.3866 (2014)
Jaszczur, S., et al.: Sparse is enough in scaling transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: Stm: spatiotemporal and motion encoding for action recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Jiang, Z., et al.: Token labeling: training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet. In: arXiv preprint. arXiv:2104.10858v2 (2021)
Jiang, Z., et al.: All tokens matter: token labeling for training better vision transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Jiao, X., et al.: Tinybert: distilling bert for natural language understanding. In: arXiv preprint. arXiv:1909.10351 (2020)
Kay, W., et al.: The kinetics human action video dataset. In: arXiv preprint. arXiv:1705.06950 (2017)
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. In: ArXiv preprint. arXiv:1404.5997 (2014)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (ICLR) (2022)
Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Liu, B., Rao, Y., Lu, J., Zhou, J., Hsieh, C.-J.: MetaDistiller: network self-boosting via meta-learned top-down distillation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 694–709. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_41
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Marin, D., Chang, J.H.R., Ranjan, A., Prabhu, A.K., Rastegari, M., Tuzel, O.: Token pooling in vision transformers. arXiv preprint. arXiv:2110.03860 (2021)
Pan, B., Panda, R., Jiang, Y., Wang, Z., Feris, R., Oliva, A.: IA-RED\(^2\): interpretability-aware redundancy reduction for vision transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable vision transformers with hierarchical pooling. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Qiu, Z., Yao, T., Ngo, C.W., Tian, X., Mei, T.: Learning spatio-temporal representation with local and global diffusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P.: Designing network design spaces. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Rao, Y., Lu, J., Lin, J., Zhou, J.: Runtime network routing for efficient image classification. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, pp. 2291-2304 (2019)
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. In: Transactions of the Association for Computational Linguistics, vol. 9, pp. 53–68 (2021)
Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: What can 8 learned tokens do for images and videos? arXiv preprint. arXiv:2106.11297 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556 (2015)
Sukhbaatar, S., Grave, E., Bojanowski, P., Joulin, A.: Adaptive attention span in transformers. In: ACL (2019)
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML) (2019)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning (ICML) (2021)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeuRIPS) (2017)
Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: Haq: Hardware-aware automated quantization with mixed precision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Wang, X., et al.: AttentionNAS: spatiotemporal attention cell search for video classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 449–465. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_27
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: introducing convolutions to vision transformers. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: arXiv preprint. arXiv:2104.06399 (2021)
Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Yu, X., Rao, Y., Wang, Z., Liu, Z., Lu, J., Zhou, J.: Pointr: diverse point cloud completion with geometry-aware transformers. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: arXiv preprint. arXiv:2101.11986 (2021)
Yue, X., Sun, S., Kuang, Z., Wei, M., Torr, P., Zhang, W., Lin, D.: Vision transformer with progressive sampling. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Zhao, H., Jiang, L., Jia, J., Torr, P., Koltun, V.: Point transformer. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Zhou, D., et al.: Deepvit: towards deeper vision transformer. arXiv preprint. arXiv:2103.11886 (2021)
Acknowledgments
Farnoush Rezaei Jafari acknowledges support by the Federal Ministry of Education and Research (BMBF) for the Berlin Institute for the Foundations of Learning and Data (BIFOLD) (01IS18037A). Juergen Gall has been supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2070 - 390732324, GA1927/4-2 (FOR 2535 Anticipating Human Behavior), and the ERC Consolidator Grant FORHUE (101044724).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fayyaz, M. et al. (2022). Adaptive Token Sampling for Efficient Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13671. Springer, Cham. https://doi.org/10.1007/978-3-031-20083-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-20083-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20082-3
Online ISBN: 978-3-031-20083-0
eBook Packages: Computer ScienceComputer Science (R0)