Abstract
Frequent use of codebook resets to enhance the usage of the Vector Quantization Variational Autoencoder (VQ-VAE) may significantly alter the codebook distribution and consequently diminish the training efficiency. In this work, we introduce a novel codebook learning approach called Exponentially Weighted Moving Average Control VQ-VAE (ECVQ-VAE). This method considers the nearest neighbor distance of the codebook during training as a monitoring sample and constructs a control line. Our quantizer restricts the update process of codebook vectors based on whether the drift during monitoring exceeds the control line while simultaneously adjusting the overall usage distribution of the codebook by promoting competition. This process enables an optimization that sustains full codebook usage while reducing the training demands. We demonstrate that our approach achieves better results than the existing methods in lightweight scenarios and extensively validate the generalizability of our quantizer across various datasets, tasks, and architectures (VQ-VAE, VQ-GAN).
Similar content being viewed by others
Data availability
All data included in this study are available upon request by contact with the corresponding author.
References
Caron, M., Bojanowski, P., Joulin, A., et al.: Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer vision (ECCV), pp 132–149 (2018)
Adiban, M., Siniscalchi, M., Stefanov, K., et al.: Hierarchical residual learning based vector quantized variational autoencorder for image reconstruction and generation. In: 33rd British Machine Vision Conference, https://doi.org/10.48550/arXiv.2208.04554, (2022)
Lee, D., Kim, C., Kim, S,. et al.: Autoregressive image generation using residual quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11523–11532, (2022), https://doi.org/10.48550/arXiv.2203.01941
Gu, Y., Wang, X., Xie, L., et al.: Vqfr: Blind face restoration with vector-quantized dictionary and parallel decoder. In: European Conference on Computer Vision, Springer, pp 126–143, (2022) , https://doi.org/10.48550/arXiv.2205.06803
Gu, X., Xu, S., Wong, Y., et al.: Multi2human: controllable human image generation with multimodal controls. Neurocomputing 587, 127682 (2024). https://doi.org/10.1016/j.neucom.2024.127682
Chang, H., Zhang, H., Barber, J., et al.: Muse: text-to-image generation via masked generative transformers. 2023. arXiv preprint https://doi.org/10.48550/arXiv.2301.00704
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12873–12883, (2021). https://doi.org/10.48550/arXiv.2012.09841
Ramesh, A., Pavlov, M., Goh, G., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, PMLR, pp 8821–8831, (2021). https://doi.org/10.48550/arXiv.2102.12092
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems 30. 2017. https://doi.org/10.48550/arXiv.1711.00937
Sadok, S., Leglaive, S., Girin, L., et al.: A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Netw. 172, 106120 (2024). https://doi.org/10.1016/j.neunet.2024.106120
Chung, YA., Tang, H., Glass, J.: (2020) Vector-quantized autoregressive predictive coding. arXiv preprint https://doi.org/10.48550/arXiv.2005.08392
Dhariwal, P., Jun, H., Payne, C., et al.: (2020) Jukebox: A generative model for music. arXiv preprint https://doi.org/10.48550/arXiv.2005.00341
Zhang, J., Yoshie, O.: Learning hierarchical discrete prior for co-speech gesture generation. Neurocomputing (2024). https://doi.org/10.1016/j.neucom.2024.127831
Li, Y., Ding, Y., Ren, Z., et al.: Qposer: quantized explicit pose prior modeling for controllable pose generation. arXiv preprint https://doi.org/10.48550/arXiv.2312.01104 (2023)
Geng, Z., Wang, C., Wei, Y., et al.: Human pose as compositional tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 660–671, (2023), https://doi.org/10.48550/arXiv.2303.11638
Liu, P., Li, S., Wang, H.: Steganography in vector quantization process of linear predictive coding for low-bit-rate speech codec. Multimedia Syst. 23, 485–497 (2017). https://doi.org/10.1007/s00530-015-0500-7
Sun, M., Wang, W., Zhu, X., et al.: Moso: decomposing motion, scene and object for video prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18727–18737, (2023), https://doi.org/10.48550/arXiv.2303.03684
Wang, M., Li, J., Li, Z., et al.: Unsupervised anomaly detection with local-sensitive vqvae and global-sensitive transformers. arXiv preprint arXiv:2303.17505 (2023)
Lao, D., Wu, Y., Liu, TY., et al.: Sub-token vit embedding via stochastic resonance transformers. arXiv preprint https://doi.org/10.48550/arXiv.2310.03967 (2023)
Khalil, A., Piechocki, R., Santos-Rodriguez, R.: Ll-vq-vae: learnable lattice vector-quantization for efficient representations. arXiv preprint https://doi.org/10.48550/arXiv.2310.09382 (2023)
Volkov, I.: Homology-constrained vector quantization entropy regularizer. arXiv preprint https://doi.org/10.48550/arXiv.2211.14363 (2022)
Kaiser, L., Bengio, S., Roy, A., et al.: Fast decoding in sequence models using discrete latent variables. In: International Conference on Machine Learning, PMLR, pp 2390–2399, (2018), https://doi.org/10.48550/arXiv.1803.03382
LeCun, Y., Bottou, L., Bengio, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791
Łańcucki, A., Chorowski, J., Sanchez, G., et al.: Robust training of vector quantized bottleneck models. In: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–7, (2020), https://doi.org/10.48550/arXiv.2005.08520
Zeghidour, N., Luebs, A., Omran, A., et al.: Soundstream: an end-to-end neural audio codec. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 495–507 (2021). https://doi.org/10.48550/arXiv.2107.03312
v Takida, Y., Shibuya, T., Liao, W., et al.: Sq-vae: variational bayes on discrete representation with self-annealed stochastic quantization. arXiv preprint https://doi.org/10.48550/arXiv.2205.07547 (2022)
Huh, M., Cheung, B., Agrawal, P., et al.: Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. arXiv preprint https://doi.org/10.48550/arXiv.2305.08842 (2023)
Zheng, C., Vedaldi, A.: Online clustered codebook. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 22798–22807, (2023), https://doi.org/10.48550/arXiv.2307.15139
Yu, J., Li, X., Koh, JY., et al.: Vector-quantized image modeling with improved vqgan. arXiv preprint https://doi.org/10.48550/arXiv.2110.04627 (2021)
Li, L., Liu, T., Wang, C., et al.: Resizing codebook of vector quantization without retraining. Multimed. Syst. 29(3), 1499–1512 (2023). https://doi.org/10.1007/s00530-023-01065-2
Panaretos, V.M., Zemel, Y.: Statistical aspects of Wasserstein distances. Annu. Rev. Stat. Appl. 6, 405–431 (2019)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4401–4410, (2019), https://doi.org/10.48550/arXiv.1812.04948
Kingma, DP., Welling, M.: Auto-encoding variational bayes. arXiv preprint https://doi.org/10.48550/arXiv.1312.6114 (2013)
Williams, W., Ringer, S., Ash, T., et al.: Hierarchical quantized autoencoders. Adv. Neural Inform. Process. Syst. 33, 4524–4535 (2020). https://doi.org/10.48550/arXiv.2002.08111
Oakland, J., Oakland, J.S.: Statistical Process Control. Routledge (2007)
Malinovskaya, A., Mozharovskyi, P., Otto, P.: Statistical process monitoring of artificial neural networks. Technometrics (2023). https://doi.org/10.1080/00401706.2023.2239886
Bayer, F.M., Kozakevicius, A.J., Cintra, R.J.: An iterative wavelet threshold for signal denoising. Signal Process. 162, 10–20 (2019). https://doi.org/10.1016/j.sigpro.2019.04.005
Van Nguyen, T.T., Heuchenne, C., Tran, K.P.: Anomaly detection for compositional data using vsi mewma control chart. IFAC-PapersOnLine 55(10), 1533–1538 (2022). https://doi.org/10.48550/arXiv.2203.15438
Addeh, J., Ebrahimzadeh, A., Azarbad, M., et al.: Statistical process control using optimized neural networks: A case study. ISA Trans. 53(5), 1489–1499 (2014)
Psarakis, S.: The use of neural networks in statistical process control charts. Qual. Reliab. Eng. Int. 27(5), 641–650 (2011)
Gan, F.: Monitoring Poisson observations using modified exponentially weighted moving average control charts. Commun. Stat.-Simul. Comput. 19(1), 103–124 (1990)
Cabral Morais, M., Knoth, S.: Improving the arl profile and the accuracy of its calculation for Poisson ewma charts. Qual. Reliab. Eng. Int. 36(3), 876–889 (2020). https://doi.org/10.1002/qre.2606
Montgomery, D.C.: Introduction to Statistical Quality Control. Wiley (2019)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint https://doi.org/10.48550/arXiv.1708.07747 (2017)
Krizhevsky, A., Hinton, G., et al.: Learning Multiple Layers of Features from Tiny Images. utoronto (2009)
Zhang, R., Isola, P., Efros, AA., et al.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 586–595, (2018), https://doi.org/10.48550/arXiv.1801.03924
Heusel, M., Ramsauer, H., Unterthiner, T., et al.: Gans trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inform. Process. syst. 30, 10 (2017). https://doi.org/10.48550/arXiv.1706.08500
Vuong, TL., Le, T., Zhao, H., et al.: (2023) Vector quantized Wasserstein auto-encoder. arXiv preprint https://doi.org/10.48550/arXiv.2302.05917 (2023)
Virtanen, P., Gommers, R., Oliphant, T.E., et al.: Scipy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods 17(3), 261–272 (2020)
Hu, J., Qian, S., Fang, Q., et al.: Efficient graph deep learning in tensorflow with tf_geometric. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 3775–3778 (2021)
Zhaok, X., Liu, H., Fan, W., et al.: Autoemb: automated embedding dimensionality search in streaming recommendations. In: 2021 IEEE International Conference on Data Mining (ICDM), IEEE, pp 896–905 (2021)
Kingma, DP., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint https://doi.org/10.48550/arXiv.1412.6980 (2014)
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 30(4), 838–855 (1992). https://doi.org/10.1137/0330046
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25 (2012)
Fu, SW., Hung, KH., Tsao, Y., et al.: Self-supervised speech quality estimation and enhancement using only clean speech. arXiv preprint arXiv:2402.16321 (2024)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, K., Shi, Q., Li, X. et al. Optimizing codebook training through control chart analysis. Multimedia Systems 31, 2 (2025). https://doi.org/10.1007/s00530-024-01555-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01555-x