Abstract
We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs). We identify the limitations of recent techniques, notably their inability to leverage meaningful inter-patch relationships, leading to the generation of simplistic and semantically vague data, impacting quantization accuracy. CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization. Specifically, we incorporate a patch-level contrastive learning scheme to generate richer, semantically meaningful data. Furthermore, we leverage contrastive learning in layer-wise evolutionary search for fixed- and mixed-precision quantization to identify optimal quantization parameters while mitigating the effects of a non-smooth loss landscape. Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives. Code is available at https://github.com/georgia-tech-synergy-lab/CLAMP-ViT.git.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Patch (subset of image): group of neighboring pixels in an image.
- 2.
weights/activations quantized to same precision for all layers.
- 3.
weights/activations quantized to different precision for different layers.
- 4.
\(({\textbf {max}}(\theta _i)-{\textbf {min}}(\theta _i))/(2^b - 1) \), where \(\theta \) is the weight tensor.
References
Baskin, C., et al.: Uniq: uniform noise injection for non-uniform quantization of neural networks. ACM Trans. Comput. Syst. (TOCS) 37(1–4), 1–15 (2021)
Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: Zeroq: a novel zero shot quantization framework. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13169–13178 (2020)
Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Cao, Y.H., Sun, P., Huang, Y., Wu, J., Zhou, S.: Synergistic self-supervised and quantization learning. In: European Conference on Computer Vision, pp. 587–604. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20056-4_34
Chen, H., et al.: Bootstrap generalization ability from loss landscape perspective. In: European Conference on Computer Vision, pp. 500–517. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-25075-0_34
Chen, K., et al.: Mmdetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Choi, K., Hong, D., Park, N., Kim, Y., Lee, J.: Qimera: data-free quantization with synthetic boundary supporting samples. Adv. Neural. Inf. Process. Syst. 34, 14835–14847 (2021)
Chuang, C.Y., et al.: Robust contrastive learning against noisy views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16670–16681 (2022)
Contributors, M.: MMSegmentation: openmmlab semantic segmentation toolbox and benchmark (2020). https://github.com/open-mmlab/mmsegmentation
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Dong, P., Li, L., Wei, Z., Niu, X., Tian, Z., Pan, H.: Emq: evolving training-free proxies for automated mixed precision quantization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17076–17086 (2023)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fei, W., Dai, W., Li, C., Zou, J., Xiong, H.: General bitwidth assignment for efficient deep convolutional neural network quantization. IEEE Trans. Neural Netw. Learn. Syst. 33(10), 5253–5267 (2021)
Frumkin, N., Gope, D., Marculescu, D.: Jumping through local minima: quantization in the loss landscape of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16978–16988 (2023)
Fu, Y., Yu, Q., Li, M., Ouyang, X., Chandra, V., Lin, Y.: Contrastive quant: quantization makes stronger contrastive learning. In: Proceedings of the 59th ACM/IEEE Design Automation Conference, pp. 205–210 (2022)
Huang, H., Yu, P.S., Wang, C.: An introduction to image synthesis with generative adversarial nets. arXiv preprint arXiv:1803.04469 (2018)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. Adv. Neural Inf. Process. Syst. 29 (2016)
Hubara, I., Nahshan, Y., Hanani, Y., Banner, R., Soudry, D.: Accurate post training quantization with small calibration sets. In: International Conference on Machine Learning, pp. 4466–4475. PMLR (2021)
Kim, N., Shin, D., Choi, W., Kim, G., Park, J.: Exploiting retraining-based mixed-precision quantization for low-cost dnn accelerator design. IEEE Trans. Neural Netw. Learn. Syst. 32(7), 2925–2938 (2020)
Kundu, S., Sun, Q., Fu, Y., Pedram, M., Beerel, P.: Analyzing the confidentiality of undistillable teachers in knowledge distillation. Adv. Neural. Inf. Process. Syst. 34, 9181–9192 (2021)
Kundu, S., Wang, S., Sun, Q., Beerel, P.A., Pedram, M.: Bmpq: bit-gradient sensitivity-driven mixed-precision quantization of dnns from scratch. In: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 588–591. IEEE (2022)
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Adv. Neural Inf. Process. Syst. 31 (2018)
Li, Y., Xu, S., Zhang, B., Cao, X., Gao, P., Guo, G.: Q-vit: accurate and fully quantized low-bit vision transformer. Adv. Neural. Inf. Process. Syst. 35, 34451–34463 (2022)
Li, Y., et al.: Efficientformer: vision transformers at mobilenet speed. Adv. Neural. Inf. Process. Syst. 35, 12934–12949 (2022)
Li, Z., Chen, M., Xiao, J., Gu, Q.: Psaq-vit v2: toward accurate and general data-free quantization for vision transformers. IEEE Trans. Neural Netw. Learn. Syst. (2023)
Li, Z., Gu, Q.: I-vit: integer-only quantization for efficient vision transformer inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17065–17075 (2023)
Li, Z., Ma, L., Chen, M., Xiao, J., Gu, Q.: Patch similarity aware data-free quantization for vision transformers. In: European Conference on Computer Vision, pp. 154–170. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20083-0_10
Li, Z., Xiao, J., Yang, L., Gu, Q.: Repq-vit: scale reparameterization for post-training quantization of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17227–17236 (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: Fq-vit: post-training quantization for fully quantized vision transformer. arXiv preprint arXiv:2111.13824 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., Gao, W.: Post-training quantization for vision transformer. Adv. Neural. Inf. Process. Syst. 34, 28092–28103 (2021)
Peters, J.W., Welling, M.: Probabilistic binary neural networks. arXiv preprint arXiv:1809.03368 (2018)
Ramachandran, A., Dhiman, A., Vandrotti, B.S., Kim, J.: Ntrans-net: a multi-scale neutrosophic-uncertainty guided transformer network for indoor depth completion. In: 2023 IEEE International Conference on Image Processing (ICIP), pp. 905–909. IEEE (2023)
Ramachandran, A., Wan, Z., Jeong, G., Gustafson, J., Krishna, T.: Algorithm-hardware co-design of distribution-aware logarithmic-posit encodings for efficient dnn inference. arXiv preprint arXiv:2403.05465 (2024)
Ranjan, N., Savakis, A.: Lrp-qvit: mixed-precision vision transformer quantization via layer-wise relevance propagation. arXiv preprint arXiv:2401.11243 (2024)
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262–7272 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp. 10347–10357. PMLR (2021)
Wang, J., Li, J., Li, W., Xuan, L., Zhang, T., Wang, W.: Positive-negative equal contrastive loss for semantic segmentation. Neurocomputing 535, 13–24 (2023)
Wightman, R.: Pytorch image models (2019). https://github.com/rwightman/pytorch-image-models. https://doi.org/10.5281/zenodo.4414861
Xiao, J., Li, Z., Yang, L., Gu, Q.: Patch-wise mixed-precision quantization of vision transformer. arXiv preprint arXiv:2305.06559 (2023)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)
Yeh, C.H., Hong, C.Y., Hsu, Y.C., Liu, T.L., Chen, Y., LeCun, Y.: Decoupled contrastive learning. In: European Conference on Computer Vision. pp. 668–684. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19809-0_38
Yin, H., et al.: Dreaming to distill: data-free knowledge transfer via deepinversion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8715–8724 (2020)
Yuan, Z., Xue, C., Chen, Y., Wu, Q., Sun, G.: Ptq4vit: post-training quantization for vision transformers with twin uniform quantization. In: European Conference on Computer Vision, pp. 191–207. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19775-8_12
Zhang, D., Yang, J., Ye, D., Hua, G.: Lq-nets: learned quantization for highly accurate and compact deep neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382 (2018)
Zhang, S., Zhou, Q., Wang, Z., Wang, F., Yan, J.: Patch-level contrastive learning via positional query for visual pre-training (2023)
Zhang, Y., Guo, X., Poggi, M., Zhu, Z., Huang, G., Mattoccia, S.: Completionformer: depth completion with convolutions and vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18527–18536 (2023)
Zhang, Y., Chen, D., Kundu, S., Li, C., Beerel, P.A.: Sal-vit: towards latency efficient private inference on vit using selective attention search with a learnable softmax approximation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5116–5125 (2023)
Zhang, Z., Lu, X., Cao, G., Yang, Y., Jiao, L., Liu, F.: Vit-yolo: transformer-based yolo for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2799–2808 (2021)
Zhong, Y., et al.: Intraq: learning synthetic images with intra-class heterogeneity for zero-shot network quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12339–12348 (2022)
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision 127, 302–321 (2019)
Acknowledgements
This work was supported in part by CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ramachandran, A., Kundu, S., Krishna, T. (2025). CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-training Quantization of ViTs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15125. Springer, Cham. https://doi.org/10.1007/978-3-031-72855-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-72855-6_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72854-9
Online ISBN: 978-3-031-72855-6
eBook Packages: Computer ScienceComputer Science (R0)