[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-031-25063-7_19guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation

Published: 16 February 2023 Publication History

Abstract

With an unprecedented increase in the number of agents and systems that aim to navigate the real world using visual cues and the rising impetus for 3D Vision Models, the importance of depth estimation is hard to understate. While supervised methods remain the gold standard in the domain, the copious amount of paired stereo data required to train such models makes them impractical. Most State of the Art (SOTA) works in the self-supervised and unsupervised domain employ a ResNet-based encoder architecture to predict disparity maps from a given input image which are eventually used alongside a camera pose estimator to predict depth without direct supervision. The fully convolutional nature of ResNets makes them susceptible to capturing per-pixel local information only, which is suboptimal for depth prediction. Our key insight for doing away with this bottleneck is to use Vision Transformers, which employ self-attention to capture the global contextual information present in an input image. Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales. It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.

References

[1]
Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4009–4018 (2021)
[2]
Bian, J., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/6364d3f0f495b6ab9dcf8d3b5c6e0b01-Paper.pdf
[3]
Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4985–4995 (2022)
[4]
Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. CoRR abs/1811.06152 (2018). http://arxiv.org/abs/1811.06152
[5]
Chen, P.Y., Liu, A.H., Liu, Y.C., Wang, Y.C.F.: Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2619–2627 (2019).
[6]
Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
[7]
Choi, J., Jung, D., Lee, D., Kim, C.: SAFENet: self-supervised monocular depth estimation with semantic-aware feature extraction (2020). https://arxiv.org/abs/2010.02893
[8]
Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: International Conference on Machine Learning, pp. 933–941. PMLR (2017)
[9]
Dong, Q., Cao, C., Fu, Y.: Incremental transformer structure enhanced image inpainting with masking positional encoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11358–11368 (2022)
[10]
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy
[11]
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[12]
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/7bccfde7714a1ebadf06c5f4cea752c1-Paper.pdf
[13]
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[14]
Gao, F., Yu, J., Shen, H., Wang, Y., Yang, H.: Attentional separation-and-aggregation network for self-supervised depth-pose learning in dynamic scenes. CoRR abs/2011.09369 (2020). https://arxiv.org/abs/2011.09369
[15]
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361 (2012).
[16]
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth prediction (2019)
[17]
Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
[18]
Gu, J., et al.: Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12094–12103 (2022)
[19]
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
[20]
Guizilini, V., Hou, R., Li, J., Ambrus, R., Gaidon, A.: Semantically-guided representation learning for self-supervised monocular depth. CoRR abs/2002.12319 (2020). https://arxiv.org/abs/2002.12319
[21]
He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8417–8427 (2022)
[22]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
[23]
He, L., Lu, J., Wang, G., Song, S., Zhou, J.: SOSD-Net: joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 440, 251–263 (2021). https://www.sciencedirect.com/science/article/pii/S0925231221002344
[24]
Huynh, L., Nguyen, P., Matas, J., Rahtu, E., Heikkilä, J.: Boosting monocular depth estimation with lightweight 3D point fusion. CoRR abs/2012.10296 (2020). https://arxiv.org/abs/2012.10296
[25]
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28 (2015)
[26]
Johnston, A., Carneiro, G.: Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
[27]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). https://arxiv.org/abs/1412.6980
[28]
Klingner M, Termöhlen J-A, Mikolajczyk J, and Fingscheidt T Vedaldi A, Bischof H, Brox T, and Frahm J-M Self-supervised monocular depth estimation: solving the dynamic object problem by semantic guidance Computer Vision – ECCV 2020 2020 Cham Springer 582-600
[29]
Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1611–1621 (2021)
[30]
Kumar, V.R., Klingner, M., Yogamani, S., Milz, S., Fingscheidt, T., Mader, P.: SynDistNet: self-supervised monocular fisheye camera distance estimation synergized with semantic segmentation for autonomous driving. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 61–71 (2021)
[31]
Lee, S., Im, S., Lin, S., Kweon, I.S.: Learning monocular depth in dynamic scenes via instance-aware projection consistency. Proceed. AAAI Conf. Artif. Intell. 35(3), 1863–1872 (2021). https://ojs.aaai.org/index.php/AAAI/article/view/16281
[32]
Li, S., Xue, F., Wang, X., Yan, Z., Zha, H.: Sequential adversarial learning for self-supervised deep visual odometry. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
[33]
Li, Z.: Monocular depth estimation toolbox. https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox (2022)
[34]
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: SwinIR: image restoration using swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1833–1844 (2021)
[35]
Luo C et al. Every pixel counts ++: joint learning of geometry and motion with 3D holistic understanding IEEE Trans. Pattern Anal. Mach. Intell. 2020 42 10 2624-2641
[36]
Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[37]
Naderi, T., Sadovnik, A., Hayward, J., Qi, H.: Monocular depth estimation with adaptive geometric attention. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 617–627 (2022).
[38]
Patil, V., Sakaridis, C., Liniger, A., Van Gool, L.: P3Depth: monocular depth estimation with a piecewise planarity prior (2022). https://arxiv.org/abs/2204.02091
[39]
Pilzer, A., Lathuiliere, S., Sebe, N., Ricci, E.: Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
[40]
Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: GeoNet: geometric neural network for joint depth and surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[41]
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188 (2021)
[42]
Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., Black, M.J.: Adversarial collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. CoRR abs/1805.09806 (2018). http://arxiv.org/abs/1805.09806
[43]
Saeedan, F., Roth, S.: Boosting monocular depth with panoptic segmentation maps. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3853–3862 (2021)
[44]
Saxena, A., Chung, S., Ng, A.: Learning depth from single monocular images. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems, vol. 18. MIT Press (2005). https://proceedings.neurips.cc/paper/2005/file/17d8da815fa21c57af9829fb0a869602-Paper.pdf
[45]
Saxena A, Sun M, and Ng AY Make3D: Learning 3D scene structure from a single still image IEEE Trans. Pattern Anal. Mach. Intell. 2009 31 5 824-840
[46]
Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883 (2016)
[47]
Tosi, F., Aleotti, F., Ramirez, P.Z., Poggi, M., Salti, S., Stefano, L.D., Mattoccia, S.: Distilled semantics for comprehensive scene understanding from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
[48]
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
[49]
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
[50]
Wang J et al. Deep high-resolution representation learning for visual recognition IEEE Trans. Pattern Anal. Mach. Intell. 2020 43 10 3349-3364
[51]
Wang, L., Zhang, J., Wang, O., Lin, Z., Lu, H.: SDC-Depth: semantic divide-and-conquer network for monocular depth estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 538–547 (2020).
[52]
Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17683–17693 (2022)
[53]
Wang Z, Bovik AC, Sheikh HR, and Simoncelli EP Image quality assessment: from error visibility to structural similarity IEEE Trans. Image Process. 2004 13 4 600-612
[54]
Weder, S., Schönberger, J.L., Pollefeys, M., Oswald, M.R.: NeuralFusion: online depth fusion in latent space. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3161–3171 (2021).
[55]
Xian, K., Zhang, J., Wang, O., Mai, L., Lin, Z., Cao, Z.: Structure-guided ranking loss for single image depth prediction. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
[56]
Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18134–18144 (2022)
[57]
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Xu, D.: Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4310–4319 (2022)
[58]
Xu X, Chen Z, and Yin F Monocular depth estimation with multi-scale feature fusion IEEE Signal Process. Lett. 2021 28 678-682
[59]
Yang, M., et al.: DOLG: single-stage image retrieval with deep orthogonal fusion of local and global features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11772–11781 (2021)
[60]
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5728–5739 (2022)
[61]
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
[62]
Zhu, S., Brazil, G., Liu, X.: The edge of depth: explicit constraints between segmentation and depth. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13113–13122 (2020).
[63]
Zou Y, Luo Z, and Huang J-B Ferrari V, Hebert M, Sminchisescu C, and Weiss Y DF-Net: unsupervised joint learning of depth and flow using cross-task consistency Computer Vision – ECCV 2018 2018 Cham Springer 38-55

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II
Oct 2022
788 pages
ISBN:978-3-031-25062-0
DOI:10.1007/978-3-031-25063-7

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 16 February 2023

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media