Adaptive Token Sampling for Efficient Vision Transformers

Mohsen Fayyaz^12,17,
Soroush Abbasi Koohpayegani¹³,
Farnoush Rezaei Jafari^14,15,
Sunando Sengupta¹⁸,
Hamid Reza Vaezi Joze¹⁶,
Eric Sommerlade¹⁸,
Hamed Pirsiavash¹³ &
…
Jürgen Gall¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13671))

Included in the following conference series:

European Conference on Computer Vision

3618 Accesses

Abstract

While state-of-the-art vision transformer models achieve promising results in image classification, they are computationally expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we therefore introduce a differentiable parameter-free Adaptive Token Sampler (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not constant anymore and varies for each input image. By integrating ATS as an additional layer within the current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to the off-the-shelf pre-trained vision transformers as a plug and play module, thus reducing their GFLOPs without any additional training. Moreover, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate the efficiency of our module in both image and video classification tasks by adding it to multiple SOTA vision transformers. Our proposed module improves the SOTA by reducing their computational costs (GFLOPs) by 2\(\times \), while preserving their accuracy on the ImageNet, Kinetics-400, and Kinetics-600 datasets. The code is available at https://adaptivetokensampling.github.io/.

M. Fayyaz, S. A. Koohpayegani, and F. R. Jafari—Equal Contribution

M. Fayyaz and S. A. Koohpayegani—Work has been done during an internship at Microsoft.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 79.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 99.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Make a Long Image Short: Adaptive Token Length for Vision Transformers

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

Super Vision Transformer

Article 02 August 2023

References

Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding. In: International Conference on Machine Learning (ICML) (2021)
Google Scholar
Bulat, A., Perez Rua, J.M., Sudhakaran, S., Martinez, B., Tzimiropoulos, G.: Space-time mixing attention for video transformer. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. In: arXiv preprint. arXiv:1808.01340v1 (2018)
Chen, C.F., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. In: arXiv preprint. arXiv:1904.10509 (2019)
Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., Shen, C.: Conditional positional encodings for vision transformers. In: arXiv preprint. arXiv:2102.10882 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Diba, A., et al.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 299–315. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_18
Chapter Google Scholar
Diba, A., et al.: Large scale holistic video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 593–610. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_35
Chapter Google Scholar
Diba, A., Sharma, V., Gool, L.V., Stiefelhagen, R.: Dynamonet: dynamic action and motion network. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D.: More Is Less: learning Efficient Video Representations by Temporal Aggregation Modules. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Google Scholar
Fayyaz, M., Bahrami, E., Diba, A., Noroozi, M., Adeli, E., Van Gool, L., Gall, J.: 3d cnns with adaptive temporal feature resolutions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: IEEE/CVF international conference on computer vision (ICCV) (2019)
Google Scholar
Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. In: arXiv preprint. arXiv:1412.6115 (2014)
Goyal, S., Choudhury, A.R., Raje, S.M., Chakaravarthy, V.T., Sabharwal, Y., Verma, A.: Power-bert: accelerating bert inference via progressive word-vector elimination. In: International Conference on Machine Learning (ICML) (2020)
Google Scholar
Guo, Q., Qiu, X., Liu, P., Shao, Y., Xue, X., Zhang, Z.: Star-transformer. In: arXiv preprint. arXiv:1902.09113 (2019)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)
Google Scholar
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. In: arXiv preprint. arXiv:1704.04861 (2017)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: arXiv preprint. arXiv:1405.3866 (2014)
Jaszczur, S., et al.: Sparse is enough in scaling transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: Stm: spatiotemporal and motion encoding for action recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Jiang, Z., et al.: Token labeling: training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet. In: arXiv preprint. arXiv:2104.10858v2 (2021)
Jiang, Z., et al.: All tokens matter: token labeling for training better vision transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Jiao, X., et al.: Tinybert: distilling bert for natural language understanding. In: arXiv preprint. arXiv:1909.10351 (2020)
Kay, W., et al.: The kinetics human action video dataset. In: arXiv preprint. arXiv:1705.06950 (2017)
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. In: ArXiv preprint. arXiv:1404.5997 (2014)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Liu, B., Rao, Y., Lu, J., Zhou, J., Hsieh, C.-J.: MetaDistiller: network self-boosting via meta-learned top-down distillation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 694–709. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_41
Chapter Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Marin, D., Chang, J.H.R., Ranjan, A., Prabhu, A.K., Rastegari, M., Tuzel, O.: Token pooling in vision transformers. arXiv preprint. arXiv:2110.03860 (2021)
Pan, B., Panda, R., Jiang, Y., Wang, Z., Feris, R., Oliva, A.: IA-RED\(^2\): interpretability-aware redundancy reduction for vision transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable vision transformers with hierarchical pooling. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Qiu, Z., Yao, T., Ngo, C.W., Tian, X., Mei, T.: Learning spatio-temporal representation with local and global diffusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P.: Designing network design spaces. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Rao, Y., Lu, J., Lin, J., Zhou, J.: Runtime network routing for efficient image classification. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, pp. 2291-2304 (2019)
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. In: Transactions of the Association for Computational Linguistics, vol. 9, pp. 53–68 (2021)
Google Scholar
Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: What can 8 learned tokens do for images and videos? arXiv preprint. arXiv:2106.11297 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556 (2015)
Sukhbaatar, S., Grave, E., Bojanowski, P., Joulin, A.: Adaptive attention span in transformers. In: ACL (2019)
Google Scholar
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML) (2019)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning (ICML) (2021)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeuRIPS) (2017)
Google Scholar
Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: Haq: Hardware-aware automated quantization with mixed precision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Wang, X., et al.: AttentionNAS: spatiotemporal attention cell search for video classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 449–465. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_27
Chapter Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: introducing convolutions to vision transformers. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: arXiv preprint. arXiv:2104.06399 (2021)
Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Yu, X., Rao, Y., Wang, Z., Liu, Z., Lu, J., Zhou, J.: Pointr: diverse point cloud completion with geometry-aware transformers. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: arXiv preprint. arXiv:2101.11986 (2021)
Yue, X., Sun, S., Kuang, Z., Wei, M., Torr, P., Zhang, W., Lin, D.: Vision transformer with progressive sampling. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Zhao, H., Jiang, L., Jia, J., Torr, P., Koltun, V.: Point transformer. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Zhou, D., et al.: Deepvit: towards deeper vision transformer. arXiv preprint. arXiv:2103.11886 (2021)

Download references

Acknowledgments

Farnoush Rezaei Jafari acknowledges support by the Federal Ministry of Education and Research (BMBF) for the Berlin Institute for the Foundations of Learning and Data (BIFOLD) (01IS18037A). Juergen Gall has been supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2070 - 390732324, GA1927/4-2 (FOR 2535 Anticipating Human Behavior), and the ERC Consolidator Grant FORHUE (101044724).

Author information

Authors and Affiliations

Microsoft, Berlin, Germany
Mohsen Fayyaz
University of California, Davis, Davis, USA
Soroush Abbasi Koohpayegani & Hamed Pirsiavash
Machine Learning Group, Technische Universität Berlin, Berlin, Germany
Farnoush Rezaei Jafari
Berlin Institute for the Foundations of Learning and Data, Berlin, Germany
Farnoush Rezaei Jafari
Meta Reality Labs, Burlingame, USA
Hamid Reza Vaezi Joze
University of Bonn, Bonn, Germany
Mohsen Fayyaz & Jürgen Gall
Microsoft, Reading, UK
Sunando Sengupta & Eric Sommerlade

Authors

Mohsen Fayyaz
View author publications
You can also search for this author in PubMed Google Scholar
Soroush Abbasi Koohpayegani
View author publications
You can also search for this author in PubMed Google Scholar
Farnoush Rezaei Jafari
View author publications
You can also search for this author in PubMed Google Scholar
Sunando Sengupta
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Reza Vaezi Joze
View author publications
You can also search for this author in PubMed Google Scholar
Eric Sommerlade
View author publications
You can also search for this author in PubMed Google Scholar
Hamed Pirsiavash
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Gall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohsen Fayyaz .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13703 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fayyaz, M. et al. (2022). Adaptive Token Sampling for Efficient Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13671. Springer, Cham. https://doi.org/10.1007/978-3-031-20083-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-20083-0_24
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20082-3
Online ISBN: 978-3-031-20083-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive Token Sampling for Efficient Vision Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Make a Long Image Short: Adaptive Token Length for Vision Transformers

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

Super Vision Transformer

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 13703 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Adaptive Token Sampling for Efficient Vision Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Make a Long Image Short: Adaptive Token Length for Vision Transformers

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

Super Vision Transformer

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 13703 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation