[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Enhancing Hierarchical Vector Quantized Autoencoders for Image Synthesis Through Multiple Decoders

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2023 (ICIAP 2023)

Abstract

Vector Quantized Variational Autoencoders (VQ-VAEs) have gained popularity in recent years due to their ability to represent images as discrete sequences of tokens that index a learned codebook of vectors, enabling efficient image compression. One variant of particular interest is VQ-VAE 2, which extends previous works by representing images as a hierarchy of sequences, resulting in finer-grained representations.

In this study, we further enhance such hierarchical autoencoder approach by introducing multiple decoders, which allow to represent images as a sum of multi-scale contributions in the pixel space. Our proposed model, the Multi Scale (MS) VQ-VAE, not only enables better control over the encoding of each sequence (resulting in improved explainability and codebook usage) but, as a consequence, also shows advantages in image synthesis. Our experiments demonstrate that the MS-VQVAE achieves comparable or superior reconstructions on various datasets and resolutions, as well as greater stability across runs. Moreover, we include a proof-of-concept trial to showcase the potential applications of our model in image synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

References

  1. Adiban, M., Stefanov, K., Siniscalchi, S.M., Salvi, G.: Hierarchical residual learning based vector quantized variational autoencoder for image reconstruction and generation. In: British Machine Vision Conference (2022)

    Google Scholar 

  2. Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: MaskGIT: masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325 (2022)

    Google Scholar 

  3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  4. Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv:abs/2005.00341 (2020)

  5. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv:abs/2010.11929 (2021)

  6. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12868–12878 (2021)

    Google Scholar 

  7. Fauw, J.D., Dieleman, S., Simonyan, K.: Hierarchical autoregressive image models with auxiliary decoders. arXiv:abs/1903.04933 (2019)

  8. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NIPS (2017)

    Google Scholar 

  9. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976 (2017)

    Google Scholar 

  10. Jang, E., Gu, S.S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv:abs/1611.01144 (2017)

  11. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43

    Chapter  Google Scholar 

  12. Kaiser, Ł., et al.: Fast decoding in sequence models using discrete latent variables. In: International Conference on Machine Learning (2018)

    Google Scholar 

  13. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)

    Google Scholar 

  14. Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quantization. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11513–11522 (2022)

    Google Scholar 

  15. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV), December (2015)

    Google Scholar 

  16. Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. arXiv:abs/1611.00712 (2017)

  17. Detlefsen, N.S., et al.: TorchMetrics - Measuring Reproducibility in PyTorch (2022). https://doi.org/10.21105/joss.04101, https://github.com/Lightning-AI/metrics

  18. van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: NIPS (2017)

    Google Scholar 

  19. Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)

    Google Scholar 

  20. Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  21. Roy, A., Vaswani, A., Neelakantan, A., Parmar, N.: Theory and experiments on vector quantized autoencoders. arXiv:abs/1805.11063 (2018)

  22. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, vol. 29 (2016)

    Google Scholar 

  23. Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)

    Google Scholar 

  24. Karras, T., Samuli Laine, T.A.: A style-based generator architecture for generative adversarial networks. IEEE 3 (2019). https://ieeexplore.ieee.org/document/8953766

  25. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  26. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004)

    Article  Google Scholar 

  27. Williams, W., Ringer, S., Ash, T., MacLeod, D., Dougherty, J., Hughes, J.: Hierarchical quantized autoencoders. Adv. Neural. Inf. Process. Syst. 33, 4524–4535 (2020)

    Google Scholar 

  28. Yu, J., et al.: Vector-quantized image modeling with improved vqgan. arXiv:abs/2110.04627 (2022)

  29. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv:abs/2206.10789 (2022)

  30. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

    Google Scholar 

  31. Zheng, C., Vuong, T.L., Cai, J., Phung, D.: MOVQ: modulating quantized vectors for high-fidelity image generation. Adv. Neural. Inf. Process. Syst. 35, 23412–23425 (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dario Serez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Serez, D., Cristani, M., Murino, V., Del Bue, A., Morerio, P. (2023). Enhancing Hierarchical Vector Quantized Autoencoders for Image Synthesis Through Multiple Decoders. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43153-1_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43152-4

  • Online ISBN: 978-3-031-43153-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics