4DIFF: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Feng Cheng^13,15,
Mi Luo¹⁴,
Huiyu Wang¹³,
Alex Dimakis¹⁴,
Lorenzo Torresani¹³,
Gedas Bertasius¹⁵ &
…
Kristen Grauman^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15082))

Included in the following conference series:

European Conference on Computer Vision

90 Accesses

Abstract

We present 4Diff, a 3D-aware diffusion model addressing the exo-to-ego viewpoint translation task—generating first-person (egocentric) view images from the corresponding third-person (exocentric) images. Building on the diffusion model’s ability to generate photorealistic images, we propose a transformer-based diffusion model that incorporates geometry priors through two mechanisms: (i) egocentric point cloud rasterization and (ii) 3D-aware rotary cross-attention. Egocentric point cloud rasterization converts the input exocentric image into an egocentric layout, which is subsequently used by a diffusion image transformer. As a component of the diffusion transformer’s denoiser block, the 3D-aware rotary cross-attention further incorporates 3D information and semantic features from the source exocentric view. Our 4Diff achieves state-of-the-art results on the challenging and diverse Ego-Exo4D multiview dataset and exhibits robust generalization to novel environments not encountered during training. Our code, processed data, and pretrained models are publicly available at https://klauscc.github.io/4diff.

Feng Cheng, Mi Luo: Equal contribution

Gedas Bertasius, Kristen Grauman: Co-lead the project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 54.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 69.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cascade-Zero123: One Image to Highly Consistent 3D with Self-prompted Nearby Views

LCM: Log Conformal Maps for Robust Representation Learning to Mitigate Perspective Distortion

Equivariant Indoor Illumination Map Estimation from a Single Image

References

Ardeshir, S., Borji, A.: Ego2Top: matching viewers in egocentric and top-view videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 253–268. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_16
Chapter Google Scholar
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-NeRF: a multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5855–5864 (2021)
Google Scholar
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470–5479 (2022)
Google Scholar
Birkl, R., Wofk, D., Müller, M.: Midas v3.1 – a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)
Cao, A., Rockwell, C., Johnson, J.: FWD: real-time novel view synthesis with forward warping and depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15713–15724 (2022)
Google Scholar
Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133 (2022)
Google Scholar
Chan, E.R., et al.: Generative novel view synthesis with 3D-aware diffusion models. arXiv preprint arXiv:2304.02602 (2023)
Chang, A., et al.: Matterport3D: Learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, H., et al.: Single-stage diffusion NeRF: a unified approach to 3D generation and reconstruction. arXiv preprint arXiv:2304.06714 (2023)
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Article Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794 (2021)
Google Scholar
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2567–2581 (2020)
Google Scholar
Duan, Y., Guo, X., Zhu, Z.: DiffusionDepth: diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023)
Flynn, J., et al.: DeepView: view synthesis with learned gradient descent. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2367–2376 (2019)
Google Scholar
Flynn, J., Neulander, I., Philbin, J., Snavely, N.: DeepStereo: learning to predict new views from the world’s imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524 (2016)
Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: The KITTI vision benchmark suite 2(5) (2015). https://www.cvlibs.net/datasets/kitti/
Grauman, K., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. arXiv preprint arXiv:2311.18259 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(1), 2249–2281 (2022)
MathSciNet Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
Google Scholar
Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023)
Google Scholar
Koh, J.Y., Lee, H., Yang, Y., Baldridge, J., Anderson, P.: Pathdreamer: a world model for indoor navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14738–14748 (2021)
Google Scholar
Kulhánek, J., Derner, E., Sattler, T., Babuška, R.: Viewformer: NeRf-free neural rendering from few images using transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13675, pp. 198–216. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_12
Chapter Google Scholar
Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2O: two hands manipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10138–10148 (2021)
Google Scholar
Li, J., Liu, K., Wu, J.: Ego-body pose estimation via ego-head pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17142–17151 (2023)
Google Scholar
Liu, G., Tang, H., Latapie, H., Yan, Y.: Exocentric to egocentric image generation via parallel generative adversarial network. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2020, pp. 1843–1847. IEEE (2020)
Google Scholar
Liu, G., Tang, H., Latapie, H.M., Corso, J.J., Yan, Y.: Cross-view exocentric to egocentric video synthesis. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 974–982 (2021)
Google Scholar
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
Google Scholar
Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751 (2019)
Luo, M., Xue, Z., Dimakis, A., Grauman, K.: Put myself in your shoes: lifting the egocentric perspective from exocentric videos. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) ECCV 2024. LNCS, vol. 15096, pp. 407–425. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-72920-1_23
Chapter Google Scholar
Mathews, J.: Coordinate-free rotation formalism. Am. J. Phys. 44(12), 1210–1210 (1976)
Article Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Niklaus, S., Mai, L., Yang, J., Liu, F.: 3D ken burns effect from a single image. ACM Trans. Graph. (ToG) 38(6), 1–15 (2019)
Article Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Park, K., et al.: Nerfies: deformable neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5865–5874 (2021)
Google Scholar
Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205 (2023)
Google Scholar
Popov, S., Bauszat, P., Ferrari, V.: CoReNet: coherent 3D scene reconstruction from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 366–383. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_22
Chapter Google Scholar
Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NeRF: neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327 (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Regmi, K., Borji, A.: Cross-view image synthesis using conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3501–3510 (2018)
Google Scholar
Ren, B., Tang, H., Sebe, N.: Cascaded cross MLP-mixer GANs for cross-view image translation. arXiv preprint arXiv:2110.10183 (2021)
Ren, X., Wang, X.: Look outside the room: synthesizing a consistent long-term 3D scene video from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3563–3573 (2022)
Google Scholar
Riegler, G., Koltun, V.: Free view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 623–640. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_37
Chapter Google Scholar
Riegler, G., Koltun, V.: Stable view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12216–12225 (2021)
Google Scholar
Rockwell, C., Fouhey, D.F., Johnson, J.: Pixelsynth: generating a 3D-consistent experience from a single image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14104–14113 (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: transformers and no 3D priors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14356–14366 (2021)
Google Scholar
Saharia, C., et al.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)
Google Scholar
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 45(4), 4713–4726 (2022)
Google Scholar
Sajjadi, M.S., et al.: Scene representation transformer: geometry-free novel view synthesis through set-latent scene representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6229–6238 (2022)
Google Scholar
Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21096–21106 (2022)
Google Scholar
Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., Zollhofer, M.: Deepvoxels: learning persistent 3d feature embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2446 (2019)
Google Scholar
Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: continuous 3D-structure-aware neural scene representations. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv:2010.02502 (2020). https://arxiv.org/abs/2010.02502
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: RoFormer: enhanced transformer with rotary position embedding. Neurocomputing 568, 127063 (2024)
Article Google Scholar
Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: ultra-fast single-view 3D reconstruction. arXiv preprint arXiv:2312.13150 (2023)
Varma, M., Wang, P., Chen, X., Chen, T., Venugopalan, S., Wang, Z.: Is attention all that neRF needs? In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=xE-LtsE-xx
Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J.J., Yan, Y.: Multi-channel attention selection GAN with cascaded semantic guidance for cross-view image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2417–2426 (2019)
Google Scholar
Trevithick, A., Yang, B.: GRF: Learning a general radiance field for 3D scene representation and rendering (2020)
Google Scholar
Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 551–560 (2020)
Google Scholar
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2021)
Google Scholar
Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628 (2022)
Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: SynSin: end-to-end view synthesis from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7467–7477 (2020)
Google Scholar
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891 (2024)
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4578–4587 (2021)
Google Scholar
Zhai, M., Bessinger, Z., Workman, S., Jacobs, N.: Predicting ground-level scene layout from aerial imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 867–875 (2017)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817 (2018)

Download references

Acknowledgment

We thank Hanwen Jiang, Yan-Bo Lin, Md Mohaiminul Islam, Ce Zhang, Yue Yang, and Soumitri Chattopadhyay for their helpful discussions. UT Austin is supported by NSF Grants AF 1901292, CNS 2148141, Tripods CCF 1934932, IFML CCF 2019844 and research gifts by Western Digital, Amazon, WNCG IAP, UT Austin Machine Learning Lab (MLL), Cisco, the Stanly P. Finch Centennial Professorship in Engineering. UNC is supported by Sony Faculty Innovation Award, Laboratory for Analytic Sciences via NC State University, ONR Award N00014-23-1-2356. K.G. is paid as a research scientist at Meta.

Author information

Authors and Affiliations

FAIR, Meta AI, New York City, USA
Feng Cheng, Huiyu Wang, Lorenzo Torresani & Kristen Grauman
The University of Texas at Austin, Austin, USA
Mi Luo, Alex Dimakis & Kristen Grauman
University of North Carolina at Chapel Hill, Chapel Hill, USA
Feng Cheng & Gedas Bertasius

Authors

Feng Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Mi Luo
View author publications
You can also search for this author in PubMed Google Scholar
Huiyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Alex Dimakis
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Torresani
View author publications
You can also search for this author in PubMed Google Scholar
Gedas Bertasius
View author publications
You can also search for this author in PubMed Google Scholar
Kristen Grauman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Cheng .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, F. et al. (2025). 4DIFF: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15082. Springer, Cham. https://doi.org/10.1007/978-3-031-72691-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-72691-0_23
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72690-3
Online ISBN: 978-3-031-72691-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

4DIFF: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cascade-Zero123: One Image to Highly Consistent 3D with Self-prompted Nearby Views

LCM: Log Conformal Maps for Robust Representation Learning to Mitigate Perspective Distortion

Equivariant Indoor Illumination Map Estimation from a Single Image

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

4DIFF: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cascade-Zero123: One Image to Highly Consistent 3D with Self-prompted Nearby Views

LCM: Log Conformal Maps for Robust Representation Learning to Mitigate Perspective Distortion

Equivariant Indoor Illumination Map Estimation from a Single Image

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation