CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-training Quantization of ViTs

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15125))

Included in the following conference series:

European Conference on Computer Vision

211 Accesses

Abstract

We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs). We identify the limitations of recent techniques, notably their inability to leverage meaningful inter-patch relationships, leading to the generation of simplistic and semantically vague data, impacting quantization accuracy. CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization. Specifically, we incorporate a patch-level contrastive learning scheme to generate richer, semantically meaningful data. Furthermore, we leverage contrastive learning in layer-wise evolutionary search for fixed- and mixed-precision quantization to identify optimal quantization parameters while mitigating the effects of a non-smooth loss landscape. Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives. Code is available at https://github.com/georgia-tech-synergy-lab/CLAMP-ViT.git.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Patch (subset of image): group of neighboring pixels in an image.
2.
weights/activations quantized to same precision for all layers.
3.
weights/activations quantized to different precision for different layers.
4.
\(({\textbf {max}}(\theta _i)-{\textbf {min}}(\theta _i))/(2^b - 1) \), where \(\theta \) is the weight tensor.

References

Baskin, C., et al.: Uniq: uniform noise injection for non-uniform quantization of neural networks. ACM Trans. Comput. Syst. (TOCS) 37(1–4), 1–15 (2021)
Google Scholar
Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: Zeroq: a novel zero shot quantization framework. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13169–13178 (2020)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
Cao, Y.H., Sun, P., Huang, Y., Wu, J., Zhou, S.: Synergistic self-supervised and quantization learning. In: European Conference on Computer Vision, pp. 587–604. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20056-4_34
Chen, H., et al.: Bootstrap generalization ability from loss landscape perspective. In: European Conference on Computer Vision, pp. 500–517. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-25075-0_34
Chen, K., et al.: Mmdetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Choi, K., Hong, D., Park, N., Kim, Y., Lee, J.: Qimera: data-free quantization with synthetic boundary supporting samples. Adv. Neural. Inf. Process. Syst. 34, 14835–14847 (2021)
Google Scholar
Chuang, C.Y., et al.: Robust contrastive learning against noisy views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16670–16681 (2022)
Google Scholar
Contributors, M.: MMSegmentation: openmmlab semantic segmentation toolbox and benchmark (2020). https://github.com/open-mmlab/mmsegmentation
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Dong, P., Li, L., Wei, Z., Niu, X., Tian, Z., Pan, H.: Emq: evolving training-free proxies for automated mixed precision quantization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17076–17086 (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fei, W., Dai, W., Li, C., Zou, J., Xiong, H.: General bitwidth assignment for efficient deep convolutional neural network quantization. IEEE Trans. Neural Netw. Learn. Syst. 33(10), 5253–5267 (2021)
Article Google Scholar
Frumkin, N., Gope, D., Marculescu, D.: Jumping through local minima: quantization in the loss landscape of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16978–16988 (2023)
Google Scholar
Fu, Y., Yu, Q., Li, M., Ouyang, X., Chandra, V., Lin, Y.: Contrastive quant: quantization makes stronger contrastive learning. In: Proceedings of the 59th ACM/IEEE Design Automation Conference, pp. 205–210 (2022)
Google Scholar
Huang, H., Yu, P.S., Wang, C.: An introduction to image synthesis with generative adversarial nets. arXiv preprint arXiv:1803.04469 (2018)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. Adv. Neural Inf. Process. Syst. 29 (2016)
Google Scholar
Hubara, I., Nahshan, Y., Hanani, Y., Banner, R., Soudry, D.: Accurate post training quantization with small calibration sets. In: International Conference on Machine Learning, pp. 4466–4475. PMLR (2021)
Google Scholar
Kim, N., Shin, D., Choi, W., Kim, G., Park, J.: Exploiting retraining-based mixed-precision quantization for low-cost dnn accelerator design. IEEE Trans. Neural Netw. Learn. Syst. 32(7), 2925–2938 (2020)
Article Google Scholar
Kundu, S., Sun, Q., Fu, Y., Pedram, M., Beerel, P.: Analyzing the confidentiality of undistillable teachers in knowledge distillation. Adv. Neural. Inf. Process. Syst. 34, 9181–9192 (2021)
Google Scholar
Kundu, S., Wang, S., Sun, Q., Beerel, P.A., Pedram, M.: Bmpq: bit-gradient sensitivity-driven mixed-precision quantization of dnns from scratch. In: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 588–591. IEEE (2022)
Google Scholar
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Adv. Neural Inf. Process. Syst. 31 (2018)
Google Scholar
Li, Y., Xu, S., Zhang, B., Cao, X., Gao, P., Guo, G.: Q-vit: accurate and fully quantized low-bit vision transformer. Adv. Neural. Inf. Process. Syst. 35, 34451–34463 (2022)
Google Scholar
Li, Y., et al.: Efficientformer: vision transformers at mobilenet speed. Adv. Neural. Inf. Process. Syst. 35, 12934–12949 (2022)
Google Scholar
Li, Z., Chen, M., Xiao, J., Gu, Q.: Psaq-vit v2: toward accurate and general data-free quantization for vision transformers. IEEE Trans. Neural Netw. Learn. Syst. (2023)
Google Scholar
Li, Z., Gu, Q.: I-vit: integer-only quantization for efficient vision transformer inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17065–17075 (2023)
Google Scholar
Li, Z., Ma, L., Chen, M., Xiao, J., Gu, Q.: Patch similarity aware data-free quantization for vision transformers. In: European Conference on Computer Vision, pp. 154–170. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20083-0_10
Li, Z., Xiao, J., Yang, L., Gu, Q.: Repq-vit: scale reparameterization for post-training quantization of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17227–17236 (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: Fq-vit: post-training quantization for fully quantized vision transformer. arXiv preprint arXiv:2111.13824 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., Gao, W.: Post-training quantization for vision transformer. Adv. Neural. Inf. Process. Syst. 34, 28092–28103 (2021)
Google Scholar
Peters, J.W., Welling, M.: Probabilistic binary neural networks. arXiv preprint arXiv:1809.03368 (2018)
Ramachandran, A., Dhiman, A., Vandrotti, B.S., Kim, J.: Ntrans-net: a multi-scale neutrosophic-uncertainty guided transformer network for indoor depth completion. In: 2023 IEEE International Conference on Image Processing (ICIP), pp. 905–909. IEEE (2023)
Google Scholar
Ramachandran, A., Wan, Z., Jeong, G., Gustafson, J., Krishna, T.: Algorithm-hardware co-design of distribution-aware logarithmic-posit encodings for efficient dnn inference. arXiv preprint arXiv:2403.05465 (2024)
Ranjan, N., Savakis, A.: Lrp-qvit: mixed-precision vision transformer quantization via layer-wise relevance propagation. arXiv preprint arXiv:2401.11243 (2024)
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262–7272 (2021)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Wang, J., Li, J., Li, W., Xuan, L., Zhang, T., Wang, W.: Positive-negative equal contrastive loss for semantic segmentation. Neurocomputing 535, 13–24 (2023)
Article Google Scholar
Wightman, R.: Pytorch image models (2019). https://github.com/rwightman/pytorch-image-models. https://doi.org/10.5281/zenodo.4414861
Xiao, J., Li, Z., Yang, L., Gu, Q.: Patch-wise mixed-precision quantization of vision transformer. arXiv preprint arXiv:2305.06559 (2023)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)
Google Scholar
Yeh, C.H., Hong, C.Y., Hsu, Y.C., Liu, T.L., Chen, Y., LeCun, Y.: Decoupled contrastive learning. In: European Conference on Computer Vision. pp. 668–684. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19809-0_38
Yin, H., et al.: Dreaming to distill: data-free knowledge transfer via deepinversion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8715–8724 (2020)
Google Scholar
Yuan, Z., Xue, C., Chen, Y., Wu, Q., Sun, G.: Ptq4vit: post-training quantization for vision transformers with twin uniform quantization. In: European Conference on Computer Vision, pp. 191–207. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19775-8_12
Zhang, D., Yang, J., Ye, D., Hua, G.: Lq-nets: learned quantization for highly accurate and compact deep neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382 (2018)
Google Scholar
Zhang, S., Zhou, Q., Wang, Z., Wang, F., Yan, J.: Patch-level contrastive learning via positional query for visual pre-training (2023)
Google Scholar
Zhang, Y., Guo, X., Poggi, M., Zhu, Z., Huang, G., Mattoccia, S.: Completionformer: depth completion with convolutions and vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18527–18536 (2023)
Google Scholar
Zhang, Y., Chen, D., Kundu, S., Li, C., Beerel, P.A.: Sal-vit: towards latency efficient private inference on vit using selective attention search with a learnable softmax approximation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5116–5125 (2023)
Google Scholar
Zhang, Z., Lu, X., Cao, G., Yang, Y., Jiao, L., Liu, F.: Vit-yolo: transformer-based yolo for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2799–2808 (2021)
Google Scholar
Zhong, Y., et al.: Intraq: learning synthetic images with intra-class heterogeneity for zero-shot network quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12339–12348 (2022)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision 127, 302–321 (2019)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, GA, USA
Akshat Ramachandran & Tushar Krishna
Intel Labs, San Diego, CA, USA
Souvik Kundu

Authors

Akshat Ramachandran
View author publications
You can also search for this author in PubMed Google Scholar
Souvik Kundu
View author publications
You can also search for this author in PubMed Google Scholar
Tushar Krishna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akshat Ramachandran .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 374 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramachandran, A., Kundu, S., Krishna, T. (2025). CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-training Quantization of ViTs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15125. Springer, Cham. https://doi.org/10.1007/978-3-031-72855-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-72855-6_18
Published: 09 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72854-9
Online ISBN: 978-3-031-72855-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics