More Web Proxy on the site http://driver.im/

Article

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

Authors:

Zheng ZhuAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XI

Pages 432 - 449

https://doi.org/10.1007/978-3-031-73247-8_25

Published: 01 November 2024 Publication History

Abstract

Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image. Current methods typically model this problem as a regression or classification task. We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to ‘denoise’ random depth distribution into a depth map with the guidance of monocular visual conditions. The process is performed in the latent space encoded by a dedicated depth encoder and decoder. Instead of diffusing ground truth (GT) depth, the model learns to reverse the process of diffusing the refined depth of itself into random depth distribution. This self-diffusion formulation overcomes the difficulty of applying generative models to sparse GT depth scenarios. The proposed approach benefits this task by refining depth estimation step by step, which is superior for generating accurate and highly detailed depth maps. Experimental results from both offline and online evaluations using the KITTI and NYU-Depth-V2 datasets indicate that the proposed method can achieve state-of-the-art performance in both indoor and outdoor settings while maintaining a reasonable inference time. The codes (https://github.com/duanyiqun/DiffusionDepth) are available online.

References

[1]

Agarwal, A., Arora, C.: Attention attention everywhere: monocular depth prediction with skip attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5861–5870 (2023)

[2]

Aich, S., Vianney, J.M.U., Islam, M.A., Liu, M.K.B.: Bidirectional attention network for monocular depth estimation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11746–11752. IEEE (2021)

[3]

Amit, T., Nachmani, E., Shaharbany, T., Wolf, L.: SegDiff: image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021)

[4]

Baranchuk, D., Voynov, A., Rubachev, I., Khrulkov, V., Babenko, A.: Label-efficient semantic segmentation with diffusion models. In: International Conference on Learning Representations (ICLR) (2022). https://openreview.net/forum?id=SlxSY2UZQT

[5]

Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: Computer Vision and Pattern Recognition (CVPR), pp. 4009–4018 (2021)

[6]

Brempong, E.A., Kornblith, S., Chen, T., Parmar, N., Minderer, M., Norouzi, M.: Denoising pretraining for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4175–4186 (2022)

[7]

Chen, S., Sun, P., Song, Y., Luo, P.: DiffusionDet: diffusion model for object detection. arXiv preprint arXiv:2211.09788 (2022)

[8]

Chen, T., Li, L., Saxena, S., Hinton, G., Fleet, D.J.: A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 (2022)

[9]

Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A.: TransFuser: imitation with transformer-based sensor fusion for autonomous driving. Pattern Anal. Mach. Intell. (PAMI) (2023)

[10]

Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 34, pp. 8780–8794 (2021)

[11]

Diaz, R., Marathe, A.: Soft labels for ordinal regression. In: Computer Vision and Pattern Recognition (CVPR), pp. 4738–4747 (2019)

[12]

Dosovitskiy, A., et al.: An image is worth

16 \times 16

words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

[13]

Duan Y and Feng C Tetko IV, Kůrková V, Karpov P, and Theis F Learning internal dense but external sparse structures of deep convolutional neural network Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning 2019 Cham Springer 247-262

Digital Library

[14]

Duan, Y., Guo, X., Zhu, Z., Wang, Z., Wang, Y.K., Lin, C.T.: MaskFuser: masked fusion of joint multi-modal tokenization for end-to-end autonomous driving. arXiv preprint arXiv:2405.07573 (2024)

[15]

Duan, Y., Zhang, Q., Xu, R.: Prompting multi-modal tokens to enhance end-to-end autonomous driving imitation learning with LLMs. arXiv preprint arXiv:2404.04869 (2024)

[16]

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), pp. 2366–2374 (2014)

[17]

Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2002–2011 (2018)

[18]

Geiger A, Lenz P, Stiller C, and Urtasun R Vision meets robotics: the KITTI dataset Int. J. Robot. Res. 2013 32 11 1231-1237

Digital Library

[19]

Graikos, A., Malkin, N., Jojic, N., Samaras, D.: Diffusion models as plug-and-play priors. arXiv preprint arXiv:2206.09012 (2022)

[20]

Guizilini, V., Ambrus, R., Burgard, W., Gaidon, A.: Sparse auxiliary networks for unified monocular depth prediction and completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11078–11088 (2021)

[21]

Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distilling cross-domain stereo networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 484–500 (2018)

[22]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

[23]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 33, pp. 6840–6851 (2020)

[24]

Hoogeboom, E., Garcia Satorras, V., Vignac, C., Welling, M.: Equivariant diffusion for molecule generation in 3D. arXiv e-prints pp. arXiv–2203 (2022)

[25]

Huynh L, Nguyen-Ha P, Matas J, Rahtu E, and Heikkilä J Vedaldi A, Bischof H, Brox T, and Frahm J-M Guiding monocular depth estimation using depth-attention volume Computer Vision – ECCV 2020 2020 Cham Springer 581-597

Digital Library

[26]

Johnston, A., Carneiro, G.: Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4756–4765 (2020)

[27]

Kim, B., Oh, Y., Ye, J.C.: Diffusion adversarial representation learning for self-supervised vessel segmentation. arXiv preprint arXiv:2209.14566 (2022)

[28]

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

[29]

Lee, J.H., Han, M.K., Ko, D.W., Suh, I.H.: From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019)

[30]

Lee, S., Lee, J., Kim, B., Yi, E., Kim, J.: Patch-wise attention network for monocular depth estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1873–1881 (2021)

[31]

Li, Z., Chen, Z., Liu, X., Jiang, J.: DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv preprint arXiv:2203.14211 (2022)

[32]

Li, Z., Wang, X., Liu, X., Jiang, J.: BinsFormer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)

[33]

Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017)

[34]

Liu, C., Kumar, S., Gu, S., Timofte, R., Van Gool, L.: VA-DepthNet: a variational approach to single image depth prediction. arXiv preprint arXiv:2302.06556 (2023)

[35]

Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

[36]

Silberman N, Hoiem D, Kohli P, and Fergus R Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Indoor segmentation and support inference from RGBD images Computer Vision – ECCV 2012 2012 Heidelberg Springer 746-760

Digital Library

[37]

Nwankpa, C., Ijomah, W., Gachagan, A., Marshall, S.: Activation functions: comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378 (2018)

[38]

Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 32 (2019)

[39]

Patil, V., Sakaridis, C., Liniger, A., Van Gool, L.: P3Depth: monocular depth estimation with a piecewise planarity prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1610–1621 (2022)

[40]

Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: Geonet: geometric neural network for joint depth and surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 283–291 (2018)

[41]

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: International Conference on Computer Vision (ICCV) (ICCV), pp. 12179–12188 (2021)

[42]

Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)

[43]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)

[44]

Saxena, A., Chung, S.H., Ng, A.Y., et al.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 18, pp. 1–8 (2005)

[45]

Shao, S., Pei, Z., Chen, W., Li, R., Liu, Z., Li, Z.: URCDC-depth: uncertainty rectified cross-distillation with cutflip for monocular depth estimation. arXiv preprint arXiv:2302.08149 (2023)

[46]

Shao, S., Pei, Z., Chen, W., Wu, X., Li, Z.: NDDepth: normal-distance assisted monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7931–7940 (2023)

[47]

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning (ICML), pp. 2256–2265. PMLR (2015)

[48]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

[49]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=St1giarCHLP

[50]

Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 32 (2019)

[51]

Song, Y., Ermon, S.: Improved techniques for training score-based generative models. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 33, pp. 12438–12448 (2020)

[52]

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=PxTIG12RRHS

[53]

Sun, J., Zhang, Q., Duan, Y., Jiang, X., Cheng, C., Xu, R.: Prompt, plan, perform: LLM-based humanoid control via quantized imitation learning. arXiv preprint arXiv:2309.11359 (2023)

[54]

Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML), pp. 6105–6114. PMLR (2019)

[55]

Trippe, B.L., et al.: Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119 (2022)

[56]

Wolleb, J., Sandkühler, R., Bieder, F., Valmaggia, P., Cattin, P.C.: Diffusion models for implicit image segmentation ensembles. arXiv preprint arXiv:2112.03145 (2021)

[57]

Yan, J., Zhao, H., Bu, P., Jin, Y.: Channel-wise attention-based network for self-supervised monocular depth estimation. In: 2021 International Conference on 3D vision (3DV), pp. 464–473. IEEE (2021)

[58]

Yang, G., Tang, H., Ding, M., Sebe, N., Ricci, E.: Transformer-based attention networks for continuous pixel-wise prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 16269–16279 (2021)

[59]

Yin, W., Liu, Y., Shen, C., Yan, Y.: Enforcing geometric constraints of virtual normal for depth prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5684–5693 (2019)

[60]

Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: Neural window fully-connected CRFs for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3916–3925 (2022)

[61]

Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: New CRFs: neural window fully-connected CRFs for monocular depth estimation. CoRR abs/2203.01502 (2022).

[62]

Zhang, Q., et al.: Whole-body humanoid robot locomotion with human reference. arXiv preprint arXiv:2402.18294 (2024)

[63]

Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4106–4115 (2019)

[64]

Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception (2023). https://arxiv.org/abs/2303.02153

[65]

Zheng, W., Song, R., Guo, X., Chen, L.: GenAD: generative end-to-end autonomous driving. arXiv preprint arXiv:2402.11502 (2024)

Index Terms

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics
    1. Image manipulation
      1. Image-based rendering

Index terms have been assigned to the content through auto-classification.

Recommendations

Multi-scale depth classification network for monocular depth estimation
Highlights
- A multiscale classification network for monocular depth estimation is proposed by transforming regression tasks into classification tasks.
Abstract
In addition to the RGB information of an image, depth information is the most critical. Monocular depth estimation is an effective method for predicting depth from RGB images. First, we propose a multiscale classification network that ...
Graphical abstract

Display Omitted
Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning
Advances in Multimedia Information Processing – PCM 2018
Abstract
Depth cues are vital in many challenging computer vision tasks. In this paper, we address the problem of dense depth prediction from a single RGB image. Compared with stereo depth estimation, sensing the depth of a scene from monocular images is ...
Efficient Unsupervised Monocular Depth Estimation with Inter-Frame Depth Interpolation
Image and Graphics
Abstract
To alleviate the need of expensive depth annotations, some existing works resort to unsupervised learning methods for depth estimation using monocular videos. To improve the accuracy of the prediction with the relationship of the inter-frame, we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XI

Sep 2024

573 pages

ISBN:978-3-031-73246-1

DOI:10.1007/978-3-031-73247-8

Editors:
Aleš Leonardis
https://ror.org/03angcq70University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
https://ror.org/05n911h24Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
https://ror.org/00hx57361Princeton University, Princeton, NJ, USA
,
Torsten Sattler
https://ror.org/03kqpb082Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
https://ror.org/02nwvxz07École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 November 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents