[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-031-73247-8_25guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

Published: 01 November 2024 Publication History

Abstract

Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image. Current methods typically model this problem as a regression or classification task. We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to ‘denoise’ random depth distribution into a depth map with the guidance of monocular visual conditions. The process is performed in the latent space encoded by a dedicated depth encoder and decoder. Instead of diffusing ground truth (GT) depth, the model learns to reverse the process of diffusing the refined depth of itself into random depth distribution. This self-diffusion formulation overcomes the difficulty of applying generative models to sparse GT depth scenarios. The proposed approach benefits this task by refining depth estimation step by step, which is superior for generating accurate and highly detailed depth maps. Experimental results from both offline and online evaluations using the KITTI and NYU-Depth-V2 datasets indicate that the proposed method can achieve state-of-the-art performance in both indoor and outdoor settings while maintaining a reasonable inference time. The codes (https://github.com/duanyiqun/DiffusionDepth) are available online.

References

[1]
Agarwal, A., Arora, C.: Attention attention everywhere: monocular depth prediction with skip attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5861–5870 (2023)
[2]
Aich, S., Vianney, J.M.U., Islam, M.A., Liu, M.K.B.: Bidirectional attention network for monocular depth estimation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11746–11752. IEEE (2021)
[3]
Amit, T., Nachmani, E., Shaharbany, T., Wolf, L.: SegDiff: image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021)
[4]
Baranchuk, D., Voynov, A., Rubachev, I., Khrulkov, V., Babenko, A.: Label-efficient semantic segmentation with diffusion models. In: International Conference on Learning Representations (ICLR) (2022). https://openreview.net/forum?id=SlxSY2UZQT
[5]
Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: Computer Vision and Pattern Recognition (CVPR), pp. 4009–4018 (2021)
[6]
Brempong, E.A., Kornblith, S., Chen, T., Parmar, N., Minderer, M., Norouzi, M.: Denoising pretraining for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4175–4186 (2022)
[7]
Chen, S., Sun, P., Song, Y., Luo, P.: DiffusionDet: diffusion model for object detection. arXiv preprint arXiv:2211.09788 (2022)
[8]
Chen, T., Li, L., Saxena, S., Hinton, G., Fleet, D.J.: A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 (2022)
[9]
Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A.: TransFuser: imitation with transformer-based sensor fusion for autonomous driving. Pattern Anal. Mach. Intell. (PAMI) (2023)
[10]
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 34, pp. 8780–8794 (2021)
[11]
Diaz, R., Marathe, A.: Soft labels for ordinal regression. In: Computer Vision and Pattern Recognition (CVPR), pp. 4738–4747 (2019)
[12]
Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[13]
Duan Y and Feng C Tetko IV, Kůrková V, Karpov P, and Theis F Learning internal dense but external sparse structures of deep convolutional neural network Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning 2019 Cham Springer 247-262
[14]
Duan, Y., Guo, X., Zhu, Z., Wang, Z., Wang, Y.K., Lin, C.T.: MaskFuser: masked fusion of joint multi-modal tokenization for end-to-end autonomous driving. arXiv preprint arXiv:2405.07573 (2024)
[15]
Duan, Y., Zhang, Q., Xu, R.: Prompting multi-modal tokens to enhance end-to-end autonomous driving imitation learning with LLMs. arXiv preprint arXiv:2404.04869 (2024)
[16]
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), pp. 2366–2374 (2014)
[17]
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2002–2011 (2018)
[18]
Geiger A, Lenz P, Stiller C, and Urtasun R Vision meets robotics: the KITTI dataset Int. J. Robot. Res. 2013 32 11 1231-1237
[19]
Graikos, A., Malkin, N., Jojic, N., Samaras, D.: Diffusion models as plug-and-play priors. arXiv preprint arXiv:2206.09012 (2022)
[20]
Guizilini, V., Ambrus, R., Burgard, W., Gaidon, A.: Sparse auxiliary networks for unified monocular depth prediction and completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11078–11088 (2021)
[21]
Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distilling cross-domain stereo networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 484–500 (2018)
[22]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
[23]
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 33, pp. 6840–6851 (2020)
[24]
Hoogeboom, E., Garcia Satorras, V., Vignac, C., Welling, M.: Equivariant diffusion for molecule generation in 3D. arXiv e-prints pp. arXiv–2203 (2022)
[25]
Huynh L, Nguyen-Ha P, Matas J, Rahtu E, and Heikkilä J Vedaldi A, Bischof H, Brox T, and Frahm J-M Guiding monocular depth estimation using depth-attention volume Computer Vision – ECCV 2020 2020 Cham Springer 581-597
[26]
Johnston, A., Carneiro, G.: Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4756–4765 (2020)
[27]
Kim, B., Oh, Y., Ye, J.C.: Diffusion adversarial representation learning for self-supervised vessel segmentation. arXiv preprint arXiv:2209.14566 (2022)
[28]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[29]
Lee, J.H., Han, M.K., Ko, D.W., Suh, I.H.: From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019)
[30]
Lee, S., Lee, J., Kim, B., Yi, E., Kim, J.: Patch-wise attention network for monocular depth estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1873–1881 (2021)
[31]
Li, Z., Chen, Z., Liu, X., Jiang, J.: DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv preprint arXiv:2203.14211 (2022)
[32]
Li, Z., Wang, X., Liu, X., Jiang, J.: BinsFormer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)
[33]
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017)
[34]
Liu, C., Kumar, S., Gu, S., Timofte, R., Van Gool, L.: VA-DepthNet: a variational approach to single image depth prediction. arXiv preprint arXiv:2302.06556 (2023)
[35]
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
[36]
Silberman N, Hoiem D, Kohli P, and Fergus R Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Indoor segmentation and support inference from RGBD images Computer Vision – ECCV 2012 2012 Heidelberg Springer 746-760
[37]
Nwankpa, C., Ijomah, W., Gachagan, A., Marshall, S.: Activation functions: comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378 (2018)
[38]
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 32 (2019)
[39]
Patil, V., Sakaridis, C., Liniger, A., Van Gool, L.: P3Depth: monocular depth estimation with a piecewise planarity prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1610–1621 (2022)
[40]
Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: Geonet: geometric neural network for joint depth and surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 283–291 (2018)
[41]
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: International Conference on Computer Vision (ICCV) (ICCV), pp. 12179–12188 (2021)
[42]
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
[43]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)
[44]
Saxena, A., Chung, S.H., Ng, A.Y., et al.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 18, pp. 1–8 (2005)
[45]
Shao, S., Pei, Z., Chen, W., Li, R., Liu, Z., Li, Z.: URCDC-depth: uncertainty rectified cross-distillation with cutflip for monocular depth estimation. arXiv preprint arXiv:2302.08149 (2023)
[46]
Shao, S., Pei, Z., Chen, W., Wu, X., Li, Z.: NDDepth: normal-distance assisted monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7931–7940 (2023)
[47]
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning (ICML), pp. 2256–2265. PMLR (2015)
[48]
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
[49]
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=St1giarCHLP
[50]
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 32 (2019)
[51]
Song, Y., Ermon, S.: Improved techniques for training score-based generative models. In: Advances in Neural Information Processing Systems (NIPS/NeurIPS), vol. 33, pp. 12438–12448 (2020)
[52]
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=PxTIG12RRHS
[53]
Sun, J., Zhang, Q., Duan, Y., Jiang, X., Cheng, C., Xu, R.: Prompt, plan, perform: LLM-based humanoid control via quantized imitation learning. arXiv preprint arXiv:2309.11359 (2023)
[54]
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML), pp. 6105–6114. PMLR (2019)
[55]
Trippe, B.L., et al.: Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119 (2022)
[56]
Wolleb, J., Sandkühler, R., Bieder, F., Valmaggia, P., Cattin, P.C.: Diffusion models for implicit image segmentation ensembles. arXiv preprint arXiv:2112.03145 (2021)
[57]
Yan, J., Zhao, H., Bu, P., Jin, Y.: Channel-wise attention-based network for self-supervised monocular depth estimation. In: 2021 International Conference on 3D vision (3DV), pp. 464–473. IEEE (2021)
[58]
Yang, G., Tang, H., Ding, M., Sebe, N., Ricci, E.: Transformer-based attention networks for continuous pixel-wise prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 16269–16279 (2021)
[59]
Yin, W., Liu, Y., Shen, C., Yan, Y.: Enforcing geometric constraints of virtual normal for depth prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5684–5693 (2019)
[60]
Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: Neural window fully-connected CRFs for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3916–3925 (2022)
[61]
Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: New CRFs: neural window fully-connected CRFs for monocular depth estimation. CoRR abs/2203.01502 (2022).
[62]
Zhang, Q., et al.: Whole-body humanoid robot locomotion with human reference. arXiv preprint arXiv:2402.18294 (2024)
[63]
Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4106–4115 (2019)
[64]
Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception (2023). https://arxiv.org/abs/2303.02153
[65]
Zheng, W., Song, R., Guo, X., Chen, L.: GenAD: generative end-to-end autonomous driving. arXiv preprint arXiv:2402.11502 (2024)

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XI
Sep 2024
573 pages
ISBN:978-3-031-73246-1
DOI:10.1007/978-3-031-73247-8
  • Editors:
  • Aleš Leonardis,
  • Elisa Ricci,
  • Stefan Roth,
  • Olga Russakovsky,
  • Torsten Sattler,
  • Gül Varol

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 November 2024

Author Tags

  1. Depth Estimation
  2. Diffusion-Denoising Probalistic Models

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media