Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance
"> Figure 1
<p>The comparison of depth maps were produced by different methods. (<b>a</b>) Raw red-green-blue (RGB) images (<b>b</b>) Ground truth (GT) depth maps [<a href="#B14-sensors-20-04856" class="html-bibr">14</a>], (<b>c</b>) Depth maps from the state-of-the-art (SOTA) practice [<a href="#B7-sensors-20-04856" class="html-bibr">7</a>], (<b>d</b>) Depth maps from our depth prediction network.</p> "> Figure 2
<p>The comparison of surface normal maps: (<b>a</b>) from left to right: RGB images, Ground-Truth (GT), surface normal maps produced by Qi et al. [<a href="#B10-sensors-20-04856" class="html-bibr">10</a>], ours. (<b>b</b>) Color-map definition: red represents left, green represents up, and blue represents outward.</p> "> Figure 3
<p>General estimation framework. (<b>a</b>) Coarse depth estimation network; (<b>b</b>) red-green-blue-depth (RGB-D) surface normal network; (<b>c</b>) refinement network.</p> "> Figure 4
<p>Encoder–decoder coarse depth network.</p> "> Figure 5
<p>Generating coarse surface normal image. (<b>a</b>) Coarse depth (<math display="inline"><semantics> <mrow> <msup> <mi>D</mi> <mo>*</mo> </msup> </mrow> </semantics></math>); (<b>b</b>)<math display="inline"><semantics> <mrow> <mo> </mo> <msup> <mi>N</mi> <mo>*</mo> </msup> </mrow> </semantics></math>.</p> "> Figure 6
<p>Surface normal adjustment network (Dense-net-121 based). (<b>a</b>) The general structure of the RGB-D surface normal network (RSN) network; (<b>b</b>) the architectures of up-projection units, fusion module and convolution blocks.</p> "> Figure 7
<p>Comparison of point clouds from estimated depth maps between Alhashim [<a href="#B6-sensors-20-04856" class="html-bibr">6</a>] and ours. (<b>a</b>) Alhashim [<a href="#B6-sensors-20-04856" class="html-bibr">6</a>], (<b>b</b>) GT, (<b>c</b>) Ours. GT stands for 3D point cloud maps from ground truth images.</p> "> Figure 8
<p>Qualitative results for depth estimation. (<b>a</b>) RGB images, (<b>b</b>) ground truth (GT), (<b>c</b>) Alhashim [<a href="#B2-sensors-20-04856" class="html-bibr">2</a>], (<b>d</b>) Our Coarse; (<b>e</b>) Refined Depth.</p> "> Figure 9
<p>Qualitative results for surface normal estimation. (<b>a</b>) RGB images, (<b>b</b>) ground truth normal maps, (<b>c</b>), Geo-Net [<a href="#B10-sensors-20-04856" class="html-bibr">10</a>], (<b>d</b>) reconstructed from in-painted ground truth depth (<b>e</b>) ours, and (<b>f</b>) reconstructed from refined depth. All images are equally scaled for better visualization.</p> "> Figure 10
<p>Comparison of time consumption: Runtimes of different up-sampling methods including. 2× bilinear interpolation [<a href="#B6-sensors-20-04856" class="html-bibr">6</a>], up and down projection [<a href="#B48-sensors-20-04856" class="html-bibr">48</a>], and up-projection [<a href="#B17-sensors-20-04856" class="html-bibr">17</a>].</p> "> Figure 11
<p>Refined depth images generating from custom images (densenet-161 model).</p> "> Figure A1
<p>Results of different methods on NYU V2 Depth. (<b>a</b>) Red-green-blue (RGB) images, (<b>b</b>) ground truth (GT), (<b>c</b>) Ranftl [<a href="#B29-sensors-20-04856" class="html-bibr">29</a>], (<b>d</b>) Hu [<a href="#B30-sensors-20-04856" class="html-bibr">30</a>], (<b>e</b>) Alhashim [<a href="#B3-sensors-20-04856" class="html-bibr">3</a>], (<b>f</b>) ours.</p> "> Figure A2
<p>Three-dimensional point cloud maps comparison on NYU Depth V2.</p> "> Figure A3
<p>Point cloud maps were reconstructed based on custom images.</p> ">
Abstract
:1. Introduction
1.1. Background
1.2. Ideas
1.3. Approach
1.4. Contributions
2. Related Work
3. Our Method
3.1. Framework Overview
3.2. Coarse Depth Estimation (CDE) Network
3.3. RGB-D Surface Normal (RSN) Network
3.4. Refinement Network
3.5. Loss Function
3.6. Recovering 3D Features from Estimated Depth
4. Experiments
4.1. Dataset
4.2. Implementation Details
4.3. Evaluation Criteria
4.4. Benchmark Performance Comparison
4.5. Computational Performance
4.6. Ablation Study
4.7. Custom Results
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Appendix A
Appendix B
Layer | Output | Operator |
---|---|---|
Input | 640 × 480 × 3 | - |
Convolution1 (CONV. 1) | 320 × 240 × 96 | CONV. 7 × 7 |
DenseBlock1 [26] (DB.1) | 160 × 120 × 96 | Avg. Pooling 2 × 2 |
DB2 | 80 × 60 × 192 | Avg. Pooling 2 × 2 |
DB3 | 40 × 30 × 384 | Avg. Pooling 2 × 2 |
DB4 | 20 × 15 × 1056 | Avg. Pooling 2 × 2 |
CONV. 2 | 20 × 15 × 2208 | CONV. 1 × 1 |
Bottle Layer | 20 × 15 × 1024 | Batch-Normal + ReLU + CONV. 1 × 1 |
UP-Sample1 (US1) | 40 × 30 × 1024 | Up-sample 2 × 2 |
Concatenating 1 | 40 × 30 × 1408 | Concatenating DB3 Avg. Pooling |
US1-CONV. 1 [2] | 40 × 30 × 512 | CONV. 3 × 3 |
US1-CONV. 2 | 40 × 30 × 512 | CONV. 3 × 3 |
UP-Sample2 (US2) | 80 × 60 × 512 | Up-sample 2 × 2 |
Concatenating2 | 80 × 60 × 704 | Concatenating DB2 Avg. Pooling |
US2-CONV. 1 | 80 × 60 × 256 | CONV. 3 × 3 |
US2-CONV. 2 | 80 × 60 × 256 | CONV. 3 × 3 |
UP-Sample3 (US3) | 160 × 120 × 256 | Up-sample 2 × 2 |
Concatenating3 | 160 × 120 × 352 | Concatenating DB1 Avg. Pooling |
US3-CONV. 1 | 160 × 120 × 128 | CONV. 3 × 3 |
UP3-CONV. 2 | 320 × 240 ×128 | CONV. 3 × 3 |
UP-Sample4 (US4) | 320 × 240 × 128 | Up-sample 2 × 2 |
Concatenating4 | 320 × 240 × 224 | Concatenating CONV. 1 Avg. Pooling |
US4-CONV. 1 | 320 × 240 × 64 | CONV. 3 × 3 |
US4-CONV. 2 | 320 × 240 × 64 | CONV. 3 × 3 |
CONV. 3 | 320 × 240 ×1 | CONV. 3 × 3 |
Output | 240 × 320 × 1 | - |
Layer | Output | Operator |
---|---|---|
INPUT | 640 × 480 × 3 | - |
Convolution1 (CONV.1) | 320 × 240 × 32 | CONV. 3 × 3 |
Inverted Residual [3] Block1 (IRB1) | 320 × 240 × 16 | Bottleneck × 1 |
Inverted Residual Block2 (IRB2) | 160 × 120 × 24 | Bottleneck × 6 |
Inverted Residual Block3 (IRB3) | 80 × 60 × 32 | Bottleneck × 6 |
Inverted Residual Block4 (IRB4) | 80 × 60 × 64 | Bottleneck × 6 |
Inverted Residual Block5 (IRB5) | 40 × 30 × 96 | Bottleneck × 6 |
Inverted Residual Block6 (IRB6) | 40 × 30 × 160 | Bottleneck × 6 |
Inverted Residual Block7 (IRB7) | 20 × 15 × 320 | Bottleneck × 6 |
CONV. 2 | 20 × 15 × 1280 | CONV. 1 × 1 |
CONV. 3 | 20 × 15 × 768 | CONV. 1 × 1 |
UP-Sample1 (US1) | 40 × 30 × 768 | Up-sample 2 × 2 |
Concatenating1 | 40 × 30 × 928 | Concatenating IRB6 |
US1-CONV. 1 | 40 × 30×384 | CONV. 3 × 3 |
US1-CONV. 2 | 40 × 30 × 384 | CONV. 3 × 3 |
UP-Sample2 (US2) | 80 × 60 × 384 | Up-sample 2 × 2 |
Concatenating2 | 80 × 60 × 448 | Concatenating IRB4 |
US2-CONV. 1 | 80 × 60 × 192 | CONV. 3 × 3 |
US2-CONV. 2 | 80 × 60 × 192 | CONV. 3 × 3 |
UP-Sample3 (US3) | 160 × 120 × 192 | Up-sample 2 × 2 |
Concatenating3 | 160 × 120 × 216 | Concatenating IRB2 |
US3-CONV. 1 | 160 × 120 × 96 | CONV. 3 × 3 |
UP3-CONV. 2 | 160 × 120 × 96 | CONV. 3 × 3 |
UP-Sample4 (US4) | 320 × 240 × 96 | Up-sample 2 × 2 |
Concatenating4 | 320 × 240 × 112 | Concatenating IRB1 |
US4-CONV. 1 | 320 × 240 ×64 | CONV. 3 × 3 |
US4-CONV. 2 | 320 × 240 × 64 | CONV. 3 × 3 |
CONV. 4 | 320 × 240 × 1 | CONV. 1 × 1 |
Output | 320 × 240 × 1 | - |
References
- Guizilini, V.; Ambrus, R.; Pillai, S.; Gaidon, A. Packnet-sfm: 3d packing for self-supervised monocular depth estimation. arXiv 2019, arXiv:1905.02693. [Google Scholar]
- Ummenhofer, B.; Zhou, H.; Uhrig, J.; Mayer, N.; Ilg, E.; Dosovitskiy, A.; Brox, T. Demon: Depth and motion network for learning monocular stereo. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5622–5631. [Google Scholar]
- Zhou, H.; Ummenhofer, B.; Brox, T. Deeptam: Deep tracking and mapping. In Proceedings of the European Conference on Computer Visio (ECCV), Munich, Germany, 8–14 September 2018; pp. 822–838. [Google Scholar]
- Yu, C.; Liu, Z.; Liu, X.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the 2018 IEEE International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
- Huang, G.; Liu, Z.; Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
- Alhashim, I.; Wonka, P. High quality monocular depth estimation via transfer learning. arXiv 2018, arXiv:1812.11941. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011. [Google Scholar]
- Chen, L.; Tang, W.; Wan, T.R.; Nigel, W.J. Self-supervised monocular image depth learning and confidence estimation. arXiv 2018, arXiv:1803.05530. [Google Scholar] [CrossRef] [Green Version]
- Hu, J.J.; Ozay, M.; Zhang, Y.; Okatani, T. Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps with Accurate Object Boundaries. arXiv 2019, arXiv:1803.08673. [Google Scholar]
- Qi, X.; Liao, R.; Liu, Z.; Urtasun, R.; Jia, J. GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 283–291. [Google Scholar]
- Zhang, Z.; Cui, Z.; Xu, C.; Yan, Y.; Sebe, N.; Yang, J. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4106–4115. [Google Scholar]
- Lee, J.-H.; Kim, C.-S. Monocular Depth Estimation Using Relative Depth Maps. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9729–9738. [Google Scholar]
- Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
- Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly annotated 3D reconstructions of indoor scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar]
- Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3D: Learning from RGB-D data in indoor environments. In Proceedings of the International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 667–676. [Google Scholar]
- Wofk, D.; Ma, F.C.; Yang, T.J.; Karaman, S.; Vivienne, S. FastDepth: Fast Monocular Depth Estimation on Embedded Systems. arXiv 2019, arXiv:1903.03273. [Google Scholar]
- Nekrasov, V.; Shen, C.H.; Reid, I. Light-Weight RefineNet for Real-Time Semantic Segmentation. arXiv 2018, arXiv:1810.03272. [Google Scholar]
- Zeng, J.; Tong, Y.; Huang, Y.; Yan, Q.; Sun, W.; Chen, J.; Wang, Y. Deep Surface Normal Estimation With Hierarchical RGB-D Fusion. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6146–6155. [Google Scholar]
- Yin, W.; Liu, Y.; Shen, C.; Yan, Y. Enforcing Geometric Constraints of Virtual Normal for Depth Prediction. In Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 5683–5692. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 2650–2658. [Google Scholar]
- Wang, X.; Fouhey, D.; GuptaIn, A. Designing deep networks for surface normal estimation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 539–547. [Google Scholar]
- Garg, R.; BG, V.K.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 740–756. [Google Scholar]
- Wang, C.; Miguel Buenaposada, J.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2022–2030. [Google Scholar]
- Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding. arXiv 2018, arXiv:1806.10556. [Google Scholar]
- Zhou, T.H.; Brown, M.; Noah, S.; David, L. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6612–6619. [Google Scholar]
- Li, R.; Wang, S.; Long, Z.; Gu, D. UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning. arXiv 2017, arXiv:1709.06841. [Google Scholar]
- Liu, F.; Chunhua, S.; Guosheng, L. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5162–5170. [Google Scholar]
- Lee, J.H.; Han, M.K.; Kong, D.W.; Suh, H. From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
- Janai, J.; Guney, F.; Ranjan, A.; Black, M.; Geiger, A. Unsupervised Learning of Multi-Frame Optical Flow with Occlusions. In Proceedings of the European Conference on Computer Visio (ECCV), Munich, Germany, 8–14 September 2018; pp. 713–731. [Google Scholar]
- Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-stitch networks for multitask learning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
- Kwak, D.-H.; Lee, S.-H. A Novel Method for Estimating Monocular Depth Using Cycle GAN and Segmentation. Sensors 2020, 20, 2567. [Google Scholar] [CrossRef] [PubMed]
- Xu, D.; Ouyang, W.; Wang, X.; Sebe, N. PAD-Net: Multi-tasks Guided Prediction and Distillation Network for Simultaneous Depth Estimation and Scene Parsing. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 675–684. [Google Scholar]
- Jiao, J.B.; Cao, Y.; Song, Y.B.; Lau, R. Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 55–71. [Google Scholar]
- Levin, A.; Lischinski, D.; Weiss, Y. Colorization using optimization. ACM Trans. Graph. 2004, 23, 689–694. [Google Scholar] [CrossRef] [Green Version]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Fouhey, D.F.; Gupta, A.; Hebert, M. Data-driven 3d primitives for single image understanding. In Proceedings of the 2013 International conference on computer vision, Sydney, Australia, 1–8 December 2013; pp. 3392–3399. [Google Scholar]
- Sun, L.H.; Wang, J.S.; Yun, H.; Zhu, Q.; Yin, B.C. Surface Normal Data Guided Depth Recovery with Graph Laplacian Regularization. In Proceedings of the 2019 ACM Multimedia Asia (MMAsia ′19), Beijing, China, 16–18 December 2019; pp. 1–6. [Google Scholar]
- Huang, J.; Lee, A.B.; Mumford, D. Statistics of range images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, SC, USA, 13–15 June 2000; pp. 324–331. [Google Scholar]
- Xiaoyang, L. Depth2pointCloud. Available online: https://github.com/ZJULiXiaoyang/depth2pointCloud (accessed on 19 July 2020).
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Hickson, S.; Raveendran, K.; Fathi, A.; Murphy, K.; Essa, I. Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction. arXiv 2019, arXiv:1906.06792. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z. Automatic differentiation in PyTorch. In Proceedings of the Advances in Neural Information Processing Systems Workshops, Long Beach, CA, USA, 4–9 December 2017; pp. 1–4. [Google Scholar]
- Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the 2015 International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–41. [Google Scholar]
- Cubuk, E.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image Using a Multi-scale Deep Network. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, M.; Chen, L. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1664–1673. [Google Scholar]
- Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer. arXiv 2019, arXiv:1907.01341. [Google Scholar]
Method | Error (Lower Is Better) | Accuracy (Higher Is Better) | ||||
---|---|---|---|---|---|---|
AbsREL. | RMSE. | ELog10. | δ < 1.25 | δ < 1.252 | δ < 1.253 | |
Eigen [21] | 0.158 | 0.641 | - | 0.769 | 0.950 | 0.988 |
Fu [7] | 0.115 | 0.509 | 0.051 | 0.828 | 0.965 | 0.992 |
Alhashim [6] (Densenet-169) | 0.123 | 0.465 | 0.053 | 0.846 | 0.974 | 0.994 |
Hu [9] (Densenet-161) | 0.123 | 0.544 | 0.053 | 0.855 | 0.972 | 0.993 |
Ours (Densenet-121-coarse) | 0.137 | 0.572 | 0.056 | 0.839 | 0.962 | 0.988 |
Ours (Densenet-121-refined) | 0.122 | 0.459 | 0.051 | 0.859 | 0.972 | 0.993 |
Method | Error (Lower Is Better) | Accuracy (Higher Is Better) | ||||
---|---|---|---|---|---|---|
Mean. | Median. | RMSE. | ||||
3DP(MW) [37] | 36.3 | 19.2 | 46.6 | 39.2 | 52.9 | 57.8 |
Wang [22] | 26.9 | 14.8 | - | 42.0 | 61.2 | 68.2 |
Qi [10] | 19.0 | 11.8 | 26.9 | 48.4 | 71.5 | 79.5 |
Ours (RGB-D fusion) | 20.6 | 11.0 | 25.6 | 47.9 | 73.2 | 81.8 |
Method | RMSE. | Frames | Epochs | Training Time (h) | Iters. | Inference Time (s) | Parms. |
---|---|---|---|---|---|---|---|
Fu [7] | 0.509 | 120K | - | - | 3M | - | 110M |
Alhashim [6] | 0.465 | 50K | 20 | 20 | 1M | 0.265 | 42.6M |
Hu [9] (Se-Net) | 0.530 | 50K | 20 | - | 1M | 0.352 | - |
Ours | 0.459 | 30K | 20 | 18 | 600k | 0.217 | 32.4M |
Method | Parm. | Error (Lower Is Better) | Accuracy (Higher Is Better) | ||||
---|---|---|---|---|---|---|---|
Abs-REL | RMSE. | ELog10. | δ < 1.25 | δ < 1.252 | δ < 1.253 | ||
DN-121-Refined | 32.4 M | 0.122 | 0.459 | 0.051 | 0.859 | 0.972 | 0.993 |
Dense-depth [2] (DN-169) | 42.6 M | 0.123 | 0.465 | 0.053 | 0.846 | 0.974 | 0.994 |
DN-161-Refined | 67.2 M | 0.116 | 0.446 | 0.049 | 0.867 | 0.976 | 0.994 |
Method | Parm. | Error (Lower Is Better) | Accuracy (Higher Is Better) | ||||
---|---|---|---|---|---|---|---|
AbsREL | RMSE. | ELog10. | δ < 1.25 | δ < 1.252 | δ < 1.253 | ||
Ours(121-refined) | 32.4 M | 0.122 | 0.459 | 0.051 | 0.859 | 0.972 | 0.993 |
Ours(V2-Refined) | 4.7 M | 0.196 | 0.519 | 0.146 | 0.811 | 0.951 | 0.979 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, K.; Qu, X.; Chen, S.; Chen, Z.; Zhang, W.; Qi, H.; Zhao, F. Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance. Sensors 2020, 20, 4856. https://doi.org/10.3390/s20174856
Huang K, Qu X, Chen S, Chen Z, Zhang W, Qi H, Zhao F. Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance. Sensors. 2020; 20(17):4856. https://doi.org/10.3390/s20174856
Chicago/Turabian StyleHuang, Kang, Xingtian Qu, Shouqian Chen, Zhen Chen, Wang Zhang, Haogang Qi, and Fengshang Zhao. 2020. "Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance" Sensors 20, no. 17: 4856. https://doi.org/10.3390/s20174856
APA StyleHuang, K., Qu, X., Chen, S., Chen, Z., Zhang, W., Qi, H., & Zhao, F. (2020). Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance. Sensors, 20(17), 4856. https://doi.org/10.3390/s20174856