Sound-Guided Semantic Video Generation

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13677))

Included in the following conference series:

European Conference on Computer Vision

3241 Accesses
7 Citations

Abstract

The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.

Code and more diverse examples are available at https://kuai-lab.github.io/eccv2022sound/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 79.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 99.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Sound2Sight: Generating Visual Dynamics from Sound and Context

Foley Music: Learning to Generate Music from Videos

References

Brouwer, H.: Audio-reactive latent interpolations with StyleGAN. In: NeurIPS 2020 Workshop on Machine Learning for Creativity and Design (2020)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chatterjee, M., Cherian, A.: Sound2Sight: generating visual dynamics from sound and context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 701–719. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_42
Chapter Google Scholar
Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: VGGSound: a large-scale audio-visual dataset. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE (2020)
Google Scholar
Chen, L., et al.: Talking-head generation with rhythmic head motion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 35–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_3
Chapter Google Scholar
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)
Google Scholar
Das, D., Biswas, S., Sinha, S., Bhowmick, B.: Speech-driven facial animation using cascaded GANs for learning of motion and texture. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 408–424. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_25
Chapter Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Google Scholar
Fox, G., Tewari, A., Elgharib, M., Theobalt, C.: Stylevideogan: a temporal generative model using a pretrained stylegan. arXiv preprint arXiv:2107.07224 (2021)
Guzhov, A., Raue, F., Hees, J., Dengel, A.: AudioCLIP: extending clip to image, text and audio (2021)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017)
Google Scholar
Jeong, D., Doh, S., Kwon, T.: Tr\(\ddot{a}\)umerai: Dreaming music with stylegan. arXiv preprint arXiv:2102.04680 (2021)
Ji, X., et al.: Audio-driven emotional video portraits. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14080–14089 (2021)
Google Scholar
Karras, T., et al.: Alias-free generative adversarial networks. Adv. Neural. Inf. Process. Syst. 34, 852–863 (2021)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (ICCV) (2011)
Google Scholar
Lahiri, A., Kwatra, V., Frueh, C., Lewis, J., Bregler, C.: LipSync3D: data-efficient learning of personalized 3D talking faces from video using pose and lighting normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2755–2764 (2021)
Google Scholar
Le Moing, G., Ponce, J., Schmid, C.: Ccvs: context-aware controllable video synthesis. Adv. Neural. Inf. Process. Syst. 34, 14042–14055 (2021)
Google Scholar
Lee, S.H., et al.: Sound-guided semantic image manipulation. arXiv preprint arXiv:2112.00007 (2021)
Li, B., Liu, X., Dinesh, K., Duan, Z., Sharma, G.: Creating a multitrack classical music performance dataset for multimodal music analysis: challenges, insights, and applications. IEEE Trans. Multimedia 21(2), 522–535 (2018)
Article Google Scholar
Mesaros, A., Heittola, T., Virtanen, T., Plumbley, M.D.: Sound event detection: a tutorial. IEEE Signal Process. Mag. 38(5), 67–83 (2021). https://doi.org/10.1109/MSP.2021.3090678
Article Google Scholar
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2085–2094 (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. Image, 2, T2 (2021)
Google Scholar
Richard, A., Lea, C., Ma, S., Gall, J., de la Torre, F., Sheikh, Y.: Audio- and gaze-driven facial animation of codec avatars. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 41–50 (2021)
Google Scholar
Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2287–2296 (2021)
Google Scholar
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2830–2839 (2017)
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 29 (2016)
Google Scholar
Skorokhodov, I., Sotnikov, G., Elhoseiny, M.: Aligning latent and image spaces to connect the unconnectable. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14144–14153 (2021)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama. ACM Trans. Graph. (TOG) 36, 1–13 (2017)
Google Scholar
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 716–731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42
Chapter Google Scholar
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=6puCSjH3hwA
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
Google Scholar
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. Adv. Neural. Inf. Process. Syst. 29, 613–621 (2016)
Google Scholar
Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: G3AN: disentangling appearance and motion for video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5264–5273 (2020)
Google Scholar
Wu, H., Jia, J., Wang, H., Dou, Y., Duan, C., Deng, Q.: Imitating arbitrary talking style for realistic audio-driven talking face synthesis. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1478–1486 (2021)
Google Scholar
Wu, H.H., Seetharaman, P., Kumar, K., Bello, J.P.: Wav2clip: Learning robust audio representations from clip (2021)
Google Scholar
Xia, W., Yang, Y., Xue, J.H., Wu, B.: TediGAN: text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2256–2265 (2021)
Google Scholar
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)
Yi, R., Ye, Z., Zhang, J., Bao, H., Liu, Y.J.: Audio-driven talking face video generation with learning-based personalized head pose (2020)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: MakeitTALK: speaker-aware talking-head animation. ACM Trans. Graph. 39(6), 1–5 (2020)
Google Scholar

Download references

Acknowledgement

This work is partially supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program(Korea University)). J. Kim is partially supported by the National Research Foundation of Korea grant (NRF-2021R1C1C1009608), Basic Science Research Program (NRF-2021R1A6A1A13044830), and ICT Creative Consilience program (IITP-2022-2022-0-01819). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agency.

Author information

Authors and Affiliations

Department of Artificial Intelligence, Korea University, Seongbuk, Korea
Seung Hyun Lee, Gyeongrok Oh, Chanyoung Kim, Won Jeong Ryoo, Hyunjun Cho, Jihyun Bae & Sangpil Kim
NVIDIA Research, NVIDIA Corporation, Santa Clara, USA
Wonmin Byeon
Graduate School of Culture Technology, KAIST, Daejeon, South Korea
Sang Ho Yoon
Department of Computer Science and Engineering, Korea University, Seongbuk, Korea
Jinkyu Kim

Authors

Seung Hyun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Gyeongrok Oh
View author publications
You can also search for this author in PubMed Google Scholar
Wonmin Byeon
View author publications
You can also search for this author in PubMed Google Scholar
Chanyoung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Won Jeong Ryoo
View author publications
You can also search for this author in PubMed Google Scholar
Sang Ho Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Hyunjun Cho
View author publications
You can also search for this author in PubMed Google Scholar
Jihyun Bae
View author publications
You can also search for this author in PubMed Google Scholar
Jinkyu Kim
View author publications
You can also search for this author in PubMed Google Scholar
Sangpil Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jinkyu Kim or Sangpil Kim .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 884 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, S.H. et al. (2022). Sound-Guided Semantic Video Generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13677. Springer, Cham. https://doi.org/10.1007/978-3-031-19790-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-19790-1_3
Published: 24 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19789-5
Online ISBN: 978-3-031-19790-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sound-Guided Semantic Video Generation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Sound2Sight: Generating Visual Dynamics from Sound and Context

Foley Music: Learning to Generate Music from Videos

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 884 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Sound-Guided Semantic Video Generation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Sound2Sight: Generating Visual Dynamics from Sound and Context

Foley Music: Learning to Generate Music from Videos

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 884 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation