[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Learning to Generate Conditional Tri-Plane for 3D-Aware Expression Controllable Portrait Animation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

In this paper, we present \(\text {Export3D}\), a one-shot 3D-aware portrait animation method that is able to control the facial expression and camera view of a given portrait image. To achieve this, we introduce a tri-plane generator with an effective expression conditioning method, which directly generates a tri-plane of 3D prior by transferring the expression parameter of 3DMM into the source image. The tri-plane is then decoded into the image of different view through a differentiable volume rendering. Existing portrait animation methods heavily rely on image warping to transfer the expression in the motion space, challenging on disentanglement of appearance and expression. In contrast, we propose a contrastive pre-training framework for appearance-free expression parameter, eliminating undesirable appearance swap when transferring a cross-identity expression. Extensive experiments show that our pre-training framework can learn the appearance-free expression representation hidden in 3DMM, and our model can generate 3D-aware expression controllable portrait images without appearance swap in the cross-identity manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 89.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 109.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the stylegan latent space? In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4432–4441 (2019)

    Google Scholar 

  2. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan++: how to edit the embedded images? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8296–8305 (2020)

    Google Scholar 

  3. Amberg, B., Knothe, R., Vetter, T.: Expression invariant 3D face recognition with a morphable model. In: 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 1–6 (2008)

    Google Scholar 

  4. An, S., Xu, H., Shi, Y., Song, G., Ogras, U.Y., Luo, L.: Panohead: geometry-aware 3D full-head synthesis in 360deg. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20950–20959 (2023)

    Google Scholar 

  5. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  6. Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: MIP-nerf: a multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5855–5864 (2021)

    Google Scholar 

  7. Bhattarai, A.R., Nießner, M., Sevastopolsky, A.: Triplanenet: an encoder for EG3D inversion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3055–3065 (2024)

    Google Scholar 

  8. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187–194 (1999)

    Google Scholar 

  9. Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16123–16133 (2022)

    Google Scholar 

  10. Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: periodic implicit generative adversarial networks for 3D-aware image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5799–5809 (2021)

    Google Scholar 

  11. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (ICML), pp. 1597–1607 (2020)

    Google Scholar 

  12. Cheng, K., et al.: Videoretalking: audio-based lip synchronization for talking head video editing in the wild. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)

    Google Scholar 

  13. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699 (2019)

    Google Scholar 

  14. Deng, Y., Yang, J., Xiang, J., Tong, X.: Gram: generative radiance manifolds for 3D-aware image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10673–10683 (2022)

    Google Scholar 

  15. Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2019)

    Google Scholar 

  16. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  17. Egger, B., Sutherland, S., Medin, S.C., Tenenbaum, J.: Identity-expression ambiguity in 3D morphable face models. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pp. 1–7 (2021)

    Google Scholar 

  18. Gao, Y., Zhou, Y., Wang, J., Li, X., Ming, X., Lu, Y.: High-fidelity and freely controllable talking head video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5609–5619 (2023)

    Google Scholar 

  19. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  20. Gu, J., Liu, L., Wang, P., Theobalt, C.: Stylenerf: a style-based 3D aware generator for high-resolution image synthesis. In: International Conference on Learning Representations (ICLR) (2022). https://openreview.net/forum?id=iUuzzTMUw9K

  21. Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, J.: Ad-nerf: audio driven neural radiance fields for talking head synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5784–5794 (2021)

    Google Scholar 

  22. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738 (2020)

    Google Scholar 

  23. Hong, F.T., Xu, D.: Implicit identity representation conditioned memory compensation network for talking head video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 23062–23072 (2023)

    Google Scholar 

  24. Karras, T., et al.: Alias-free generative adversarial networks. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 852–863 (2021)

    Google Scholar 

  25. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8110–8119 (2020)

    Google Scholar 

  26. Ke, Z., Sun, J., Li, K., Yan, Q., Lau, R.W.: Modnet: real-time trimap-free portrait matting via objective decomposition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1140–1147 (2022)

    Google Scholar 

  27. Khakhulin, T., Sklyarova, V., Lempitsky, V., Zakharov, E.: Realistic one-shot mesh-based head avatars. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 345–362. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_20

    Chapter  Google Scholar 

  28. Ki, T., Min, D.: Stylelipsync: style-based personalized lip-sync video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22841–22850 (2023)

    Google Scholar 

  29. Ko, J., Cho, K., Choi, D., Ryoo, K., Kim, S.: 3D GAN inversion with pose optimization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2967–2976 (2023)

    Google Scholar 

  30. Li, W., et al.: One-shot high-fidelity talking-head synthesis with deformable neural radiance field. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17969–17978 (2023)

    Google Scholar 

  31. Li, X., De Mello, S., Liu, S., Nagano, K., Iqbal, U., Kautz, J.: Generalizable one-shot 3D neural head avatar. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 36 (2024)

    Google Scholar 

  32. Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751 (2019)

  33. Ma, Y., et al.: Styletalk: one-shot talking head generation with controllable speaking styles. arXiv preprint arXiv:2301.01081 (2023)

  34. Ma, Z., Zhu, X., Qi, G.J., Lei, Z., Zhang, L.: Otavatar: one-shot talking face avatar with controllable tri-plane rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16901–16910 (2023)

    Google Scholar 

  35. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

    Google Scholar 

  36. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  37. Min, D., Song, M., Hwang, S.J.: Styletalker: one-shot style-based audio-driven talking head video generation. arXiv preprint arXiv:2208.10922 (2022)

  38. Pang, Y., et al.: DPE: disentanglement of pose and expression for general video portrait editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 427–436 (2023)

    Google Scholar 

  39. Park, K., et al.: Nerfies: deformable neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5865–5874 (2021)

    Google Scholar 

  40. Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)

  41. Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)

    Google Scholar 

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763 (2021)

    Google Scholar 

  43. Ren, Y., Li, G., Chen, Y., Li, T.H., Liu, S.: Pirenderer: controllable portrait image generation via semantic neural rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13759–13768 (2021)

    Google Scholar 

  44. Richardson, E., et al.: Encoding in style: a stylegan encoder for image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2287–2296 (2021)

    Google Scholar 

  45. Roich, D., Mokady, R., Bermano, A.H., Cohen-Or, D.: Pivotal tuning for latent-based editing of real images. ACM Trans. Graph. (TOG) 42(1), 1–13 (2022)

    Article  Google Scholar 

  46. Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: Graf: generative radiance fields for 3D-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  47. Shin, M., et al.: Ballgan: 3D-aware image synthesis with a spherical background. arXiv preprint arXiv:2301.09091 (2023)

  48. Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)

    Google Scholar 

  49. Sun, J., et al.: Next3d: generative neural texture rasterization for 3D-aware head avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20991–21002 (2023)

    Google Scholar 

  50. Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for stylegan image manipulation. ACM Trans. Graph. (TOG) 40(4), 1–14 (2021)

    Article  Google Scholar 

  51. Trevithick, A., et al.: Real-time radiance fields for single-image portrait view synthesis. arXiv preprint arXiv:2305.02310 (2023)

  52. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)

    Google Scholar 

  53. Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10039–10049 (2021)

    Google Scholar 

  54. Wang, Y., Yang, D., Bremond, F., Dantcheva, A.: Latent image animator: learning to animate images via latent space navigation. arXiv preprint arXiv:2203.09043 (2022)

  55. Wu, Y., Deng, Y., Yang, J., Wei, F., Chen, Q., Tong, X.: Anifacegan: animatable 3D-aware face image generation for video avatars. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 36188–36201 (2022)

    Google Scholar 

  56. Xiang, J., Yang, J., Deng, Y., Tong, X.: GRAM-HD: 3D-consistent image generation at high resolution with generative radiance manifolds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2195–2205 (2023)

    Google Scholar 

  57. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 12077–12090 (2021)

    Google Scholar 

  58. Xie, J., Ouyang, H., Piao, J., Lei, C., Chen, Q.: High-fidelity 3D GAN inversion by pseudo-multi-view optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 321–331 (2023)

    Google Scholar 

  59. Xie, L., Wang, X., Zhang, H., Dong, C., Shan, Y.: VFHQ: a high-quality dataset and benchmark for video face super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 657–666 (2022)

    Google Scholar 

  60. Yin, F., et al.: Styleheat: one-shot high-resolution editable talking face generation via pre-trained StyleGAN. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 85–101. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_6

    Chapter  Google Scholar 

  61. Yin, F., et al.: 3D GAN inversion with facial symmetry prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 342–351 (2023)

    Google Scholar 

  62. Yu, W., et al.: NOFA: NeRF-based one-shot facial avatar reconstruction. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–12 (2023)

    Google Scholar 

  63. Yuan, Z., Zhu, Y., Li, Y., Liu, H., Yuan, C.: Make encoder great again in 3D GAN inversion through geometry and occlusion-aware encoding. arXiv preprint arXiv:2303.12326 (2023)

  64. Zhang, W., et al.: Sadtalker: learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8652–8661 (2023)

    Google Scholar 

  65. Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3657–3666 (2022)

    Google Scholar 

  66. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2223–2232 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taekyung Ki .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 814 KB)

Supplementary material 2 (mp4 2005 KB)

Supplementary material 3 (mp4 571 KB)

Supplementary material 4 (mp4 2267 KB)

Supplementary material 5 (mp4 258 KB)

Supplementary material 6 (mp4 1145 KB)

Supplementary material 7 (mp4 517 KB)

Supplementary material 8 (mp4 3245 KB)

Supplementary material 9 (mp4 753 KB)

Supplementary material 10 (mp4 527 KB)

Supplementary material 11 (mp4 561 KB)

Supplementary material 12 (mp4 590 KB)

Supplementary material 13 (mp4 160 KB)

Supplementary material 14 (mp4 142 KB)

Supplementary material 15 (mp4 232 KB)

Supplementary material 16 (mp4 255 KB)

Supplementary material 17 (mp4 223 KB)

Supplementary material 18 (mp4 379 KB)

Supplementary material 19 (mp4 606 KB)

Supplementary material 20 (mp4 705 KB)

Supplementary material 21 (pdf 1544 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ki, T., Min, D., Chae, G. (2025). Learning to Generate Conditional Tri-Plane for 3D-Aware Expression Controllable Portrait Animation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15059. Springer, Cham. https://doi.org/10.1007/978-3-031-73232-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73232-4_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73231-7

  • Online ISBN: 978-3-031-73232-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics