Enhancing Hierarchical Vector Quantized Autoencoders for Image Synthesis Through Multiple Decoders

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14234))

Included in the following conference series:

International Conference on Image Analysis and Processing

1262 Accesses
9 Altmetric

Abstract

Vector Quantized Variational Autoencoders (VQ-VAEs) have gained popularity in recent years due to their ability to represent images as discrete sequences of tokens that index a learned codebook of vectors, enabling efficient image compression. One variant of particular interest is VQ-VAE 2, which extends previous works by representing images as a hierarchy of sequences, resulting in finer-grained representations.

In this study, we further enhance such hierarchical autoencoder approach by introducing multiple decoders, which allow to represent images as a sum of multi-scale contributions in the pixel space. Our proposed model, the Multi Scale (MS) VQ-VAE, not only enables better control over the encoding of each sequence (resulting in improved explainability and codebook usage) but, as a consequence, also shows advantages in image synthesis. Our experiments demonstrate that the MS-VQVAE achieves comparable or superior reconstructions on various datasets and resolutions, as well as greater stability across runs. Moreover, we include a proof-of-concept trial to showcase the potential applications of our model in image synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

TransVQ-VAE: Generating Diverse Images Using Hierarchical Representation Learning

Image Super-Resolution with Deep Variational Autoencoders

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

References

Adiban, M., Stefanov, K., Siniscalchi, S.M., Salvi, G.: Hierarchical residual learning based vector quantized variational autoencoder for image reconstruction and generation. In: British Machine Vision Conference (2022)
Google Scholar
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: MaskGIT: masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325 (2022)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv:abs/2005.00341 (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv:abs/2010.11929 (2021)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12868–12878 (2021)
Google Scholar
Fauw, J.D., Dieleman, S., Simonyan, K.: Hierarchical autoregressive image models with auxiliary decoders. arXiv:abs/1903.04933 (2019)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NIPS (2017)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976 (2017)
Google Scholar
Jang, E., Gu, S.S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv:abs/1611.01144 (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Kaiser, Ł., et al.: Fast decoding in sequence models using discrete latent variables. In: International Conference on Machine Learning (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Google Scholar
Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quantization. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11513–11522 (2022)
Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV), December (2015)
Google Scholar
Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. arXiv:abs/1611.00712 (2017)
Detlefsen, N.S., et al.: TorchMetrics - Measuring Reproducibility in PyTorch (2022). https://doi.org/10.21105/joss.04101, https://github.com/Lightning-AI/metrics
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: NIPS (2017)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Google Scholar
Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Roy, A., Vaswani, A., Neelakantan, A., Parmar, N.: Theory and experiments on vector quantized autoencoders. arXiv:abs/1805.11063 (2018)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Google Scholar
Karras, T., Samuli Laine, T.A.: A style-based generator architecture for generative adversarial networks. IEEE 3 (2019). https://ieeexplore.ieee.org/document/8953766
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004)
Article Google Scholar
Williams, W., Ringer, S., Ash, T., MacLeod, D., Dougherty, J., Hughes, J.: Hierarchical quantized autoencoders. Adv. Neural. Inf. Process. Syst. 33, 4524–4535 (2020)
Google Scholar
Yu, J., et al.: Vector-quantized image modeling with improved vqgan. arXiv:abs/2110.04627 (2022)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv:abs/2206.10789 (2022)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zheng, C., Vuong, T.L., Cai, J., Phung, D.: MOVQ: modulating quantized vectors for high-fidelity image generation. Adv. Neural. Inf. Process. Syst. 35, 23412–23425 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

PAVIS Department, Italian Institute of Technology (IIT), Genoa, Italy
Dario Serez, Vittorio Murino, Alessio Del Bue & Pietro Morerio
DIBRIS Department, University of Genova (UniGE), Genoa, Italy
Vittorio Murino
DITEN Department, University of Genova (UniGE), Genoa, Italy
Dario Serez
Department of Computer Science, University of Verona (UniVR), Verona, Italy
Marco Cristani & Vittorio Murino

Authors

Dario Serez
View author publications
You can also search for this author in PubMed Google Scholar
Marco Cristani
View author publications
You can also search for this author in PubMed Google Scholar
Vittorio Murino
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Del Bue
View author publications
You can also search for this author in PubMed Google Scholar
Pietro Morerio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dario Serez .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Gian Luca Foresti
University of Udine, Udine, Italy
Andrea Fusiello
University of York, York, UK
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Serez, D., Cristani, M., Murino, V., Del Bue, A., Morerio, P. (2023). Enhancing Hierarchical Vector Quantized Autoencoders for Image Synthesis Through Multiple Decoders. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-43153-1_33
Published: 05 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics