Abstract
Vector Quantized Variational Autoencoders (VQ-VAEs) have gained popularity in recent years due to their ability to represent images as discrete sequences of tokens that index a learned codebook of vectors, enabling efficient image compression. One variant of particular interest is VQ-VAE 2, which extends previous works by representing images as a hierarchy of sequences, resulting in finer-grained representations.
In this study, we further enhance such hierarchical autoencoder approach by introducing multiple decoders, which allow to represent images as a sum of multi-scale contributions in the pixel space. Our proposed model, the Multi Scale (MS) VQ-VAE, not only enables better control over the encoding of each sequence (resulting in improved explainability and codebook usage) but, as a consequence, also shows advantages in image synthesis. Our experiments demonstrate that the MS-VQVAE achieves comparable or superior reconstructions on various datasets and resolutions, as well as greater stability across runs. Moreover, we include a proof-of-concept trial to showcase the potential applications of our model in image synthesis.
Similar content being viewed by others
References
Adiban, M., Stefanov, K., Siniscalchi, S.M., Salvi, G.: Hierarchical residual learning based vector quantized variational autoencoder for image reconstruction and generation. In: British Machine Vision Conference (2022)
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: MaskGIT: masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325 (2022)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv:abs/2005.00341 (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv:abs/2010.11929 (2021)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12868–12878 (2021)
Fauw, J.D., Dieleman, S., Simonyan, K.: Hierarchical autoregressive image models with auxiliary decoders. arXiv:abs/1903.04933 (2019)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NIPS (2017)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976 (2017)
Jang, E., Gu, S.S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv:abs/1611.01144 (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Kaiser, Ł., et al.: Fast decoding in sequence models using discrete latent variables. In: International Conference on Machine Learning (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quantization. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11513–11522 (2022)
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV), December (2015)
Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. arXiv:abs/1611.00712 (2017)
Detlefsen, N.S., et al.: TorchMetrics - Measuring Reproducibility in PyTorch (2022). https://doi.org/10.21105/joss.04101, https://github.com/Lightning-AI/metrics
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: NIPS (2017)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Roy, A., Vaswani, A., Neelakantan, A., Parmar, N.: Theory and experiments on vector quantized autoencoders. arXiv:abs/1805.11063 (2018)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Karras, T., Samuli Laine, T.A.: A style-based generator architecture for generative adversarial networks. IEEE 3 (2019). https://ieeexplore.ieee.org/document/8953766
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004)
Williams, W., Ringer, S., Ash, T., MacLeod, D., Dougherty, J., Hughes, J.: Hierarchical quantized autoencoders. Adv. Neural. Inf. Process. Syst. 33, 4524–4535 (2020)
Yu, J., et al.: Vector-quantized image modeling with improved vqgan. arXiv:abs/2110.04627 (2022)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv:abs/2206.10789 (2022)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Zheng, C., Vuong, T.L., Cai, J., Phung, D.: MOVQ: modulating quantized vectors for high-fidelity image generation. Adv. Neural. Inf. Process. Syst. 35, 23412–23425 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Serez, D., Cristani, M., Murino, V., Del Bue, A., Morerio, P. (2023). Enhancing Hierarchical Vector Quantized Autoencoders for Image Synthesis Through Multiple Decoders. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-43153-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)