Radar-Camera Fusion Network for Depth Estimation in Structured Driving Scenes
<p>We try to predict different parts of an image using different decoders, expecting it to extract potential semantic information. SIC Block represents the sparse invariant convolution.</p> "> Figure 2
<p>Different stages of fusion. (<b>a</b>) Early fusion; (<b>b</b>) late fusion.</p> "> Figure 3
<p>The architecture of the proposed method. We use a double encoder–triple decoder structure. Each decoder branch focuses on a specific category and predicts a depth map. The extra decoder surrounded by a dotted box is introduced to fuse three depth maps into one of the fusion strategies we used.</p> "> Figure 4
<p>Encoder for feature extraction.</p> "> Figure 5
<p>Examples of data used in the experiment. (<b>a</b>) RGB image; (<b>b</b>) millimeter-wave radar data (enhanced by 5×); (<b>c</b>) depth labels; (<b>d</b>) semantic segmentation labels.</p> "> Figure 6
<p>Sparse invariant convolution procedure. The symbol “*” in the image indicates a two-dimensional convolution operation, and the symbol “⊙” indicates the multiplication of the corresponding pixel positions.</p> "> Figure 7
<p>We stack 5 sparse invariant convolution layers to extract features preliminarily in radar-branch.</p> "> Figure 8
<p>The upsampling module used in the decoder. Upsample to 2× input.</p> "> Figure 9
<p>Fusion methods of depth maps. (<b>a</b>) Summation of them. (<b>b</b>) Weighed summation using confidence maps aligned with depth maps. (<b>c</b>) Select the corresponding regions using a generated semantic segmentation map.</p> "> Figure 10
<p>Results of depth estimation of single decoder model and our proposed method with different fusion methods on nuScenes dataset. The red circles in the figure indicate areas that require special attention.(<b>a</b>) RGB image, (<b>b</b>) ground truth, (<b>c</b>) single decoder results, (<b>d</b>–<b>f</b>) results of our proposed model using add, conf, and seg fusion methods, respectively.</p> ">
Abstract
:1. Introduction
- A triple decoder architecture based on CNN is introduced into the network to address different areas of an image, which makes better use of latent semantic information in autonomous driving scenes;
- We apply a variant of the L1 loss function during the training phase to make the network more focused on the main vision objectives, which is more in line with human driving habits;
- We evaluate our proposed depth estimation network on the nuScenes dataset, showing that our approach can significantly reduce estimation error, especially in areas of greater interest to drivers.
2. Related Works
2.1. Fusion of Radar and Camera Applications
2.2. Monocular Depth Estimation
2.3. Depth Completion
2.4. Depth Estimation with Semantic Information
3. Methodology
3.1. Overview Architecture
3.2. Feature Extraction
3.3. Depth Decoder
3.4. Fusion of Depth Maps
3.5. Loss Function
4. Experiments
4.1. Experimental Setup
4.2. Comparing Results
4.3. Optimal Parameter of Loss Function
4.4. Fusion of Depth Maps
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ju, Z.; Zhang, H.; Li, X.; Chen, X.; Han, J.; Yang, M. A survey on attack detection and resilience for connected and automated vehicles: From vehicle dynamics and control perspective. IEEE Trans. Intell. Veh. 2022, 7, 815–837. [Google Scholar]
- Peng, X.; Zhu, X.; Wang, T.; Ma, Y. SIDE: Center-based stereo 3D detector with structure-aware instance depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 119–128. [Google Scholar]
- Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 13–14 February 2023; Volume 37, pp. 1477–1485. [Google Scholar]
- Alaba, S.Y.; Ball, J.E. Deep Learning-Based Image 3-D Object Detection for Autonomous Driving. IEEE Sens. J. 2023, 23, 3378–3394. [Google Scholar]
- Wei, R.; Li, B.; Mo, H.; Zhong, F.; Long, Y.; Dou, Q.; Liu, Y.H.; Sun, D. Distilled Visual and Robot Kinematics Embeddings for Metric Depth Estimation in Monocular Scene Reconstruction. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 8072–8077. [Google Scholar]
- Sayed, M.; Gibson, J.; Watson, J.; Prisacariu, V.; Firman, M.; Godard, C. SimpleRecon: 3D reconstruction without 3D convolutions. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 1–19. [Google Scholar]
- Xu, R.; Dong, W.; Sharma, A.; Kaess, M. Learned depth estimation of 3d imaging radar for indoor mapping. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 13260–13267. [Google Scholar]
- Hong, F.T.; Zhang, L.; Shen, L.; Xu, D. Depth-aware generative adversarial network for talking head video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3397–3406. [Google Scholar]
- Lee, J.H.; Heo, M.; Kim, K.R.; Kim, C.S. Single-image depth estimation based on fourier domain analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 330–339. [Google Scholar]
- Ramamonjisoa, M.; Du, Y.; Lepetit, V. Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14648–14657. [Google Scholar]
- Qi, X.; Liao, R.; Liu, Z.; Urtasun, R.; Jia, J. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 283–291. [Google Scholar]
- Lu, K.; Barnes, N.; Anwar, S.; Zheng, L. From depth what can you see? Depth completion via auxiliary image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11306–11315. [Google Scholar]
- Van Gansbeke, W.; Neven, D.; De Brabandere, B.; Van Gool, L. Sparse and noisy lidar completion with rgb guidance and uncertainty. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; pp. 1–6. [Google Scholar]
- Fu, C.; Dong, C.; Mertz, C.; Dolan, J.M. Depth completion via inductive fusion of planar lidar and monocular camera. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10843–10848. [Google Scholar]
- Vandana, G.; Pardhasaradhi, B.; Srihari, P. Intruder detection and tracking using 77 ghz fmcw radar and camera data. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022; pp. 1–6. [Google Scholar]
- Ram, S.S. Fusion of inverse synthetic aperture radar and camera images for automotive target tracking. IEEE J. Sel. Top. Signal Process. 2022, 17, 431–444. [Google Scholar] [CrossRef]
- Hazra, S.; Feng, H.; Kiprit, G.N.; Stephan, M.; Servadei, L.; Wille, R.; Weigel, R.; Santra, A. Cross-modal learning of graph representations using radar point cloud for long-range gesture recognition. In Proceedings of the 2022 IEEE 12th Sensor Array and Multichannel Signal Processing Workshop (SAM), Trondheim, Norway, 20–23 June 2022; pp. 350–354. [Google Scholar]
- Shokouhmand, A.; Eckstrom, S.; Gholami, B.; Tavassolian, N. Camera-augmented non-contact vital sign monitoring in real time. IEEE Sens. J. 2022, 22, 11965–11978. [Google Scholar] [CrossRef]
- Sengupta, A.; Cao, S. mmpose-nlp: A natural language processing approach to precise skeletal pose estimation using mmwave radars. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef] [PubMed]
- Schroth, C.A.; Eckrich, C.; Kakouche, I.; Fabian, S.; von Stryk, O.; Zoubir, A.M.; Muma, M. Emergency Response Person Localization and Vital Sign Estimation Using a Semi-Autonomous Robot Mounted SFCW Radar. arXiv 2023, arXiv:2305.15795. [Google Scholar]
- Hussain, M.I.; Azam, S.; Rafique, M.A.; Sheri, A.M.; Jeon, M. Drivable region estimation for self-driving vehicles using radar. IEEE Trans. Veh. Technol. 2022, 71, 5971–5982. [Google Scholar] [CrossRef]
- Wu, B.X.; Lin, J.J.; Kuo, H.K.; Chen, P.Y.; Guo, J.I. Radar and Camera Fusion for Vacant Parking Space Detection. In Proceedings of the 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Incheon, Republic of Korea, 13–15 June 2022; pp. 242–245. [Google Scholar]
- Kubo, K.; Ito, T. Driver’s Sleepiness Estimation Using Millimeter Wave Radar and Camera. In Proceedings of the 2022 IEEE CPMT Symposium Japan (ICSJ), Kyoto, Japan, 9–11 November 2022; pp. 98–99. [Google Scholar]
- de Araujo, P.R.M.; Elhabiby, M.; Givigi, S.; Noureldin, A. A Novel Method for Land Vehicle Positioning: Invariant Kalman Filters and Deep-Learning-Based Radar Speed Estimation. IEEE Trans. Intell. Veh. 2023, 1–12. [Google Scholar] [CrossRef]
- Liu, B.; Gould, S.; Koller, D. Single image depth estimation from predicted semantic labels. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1253–1260. [Google Scholar]
- Ladicky, L.; Shi, J.; Pollefeys, M. Pulling things out of perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 89–96. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, USA, 8–13 December 2014. [Google Scholar]
- Li, B.; Shen, C.; Dai, Y.; Van Den Hengel, A.; He, M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1119–1127. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Hu, J.; Ozay, M.; Zhang, Y.; Okatani, T. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1043–1051. [Google Scholar]
- Chen, Y.; Zhao, H.; Hu, Z.; Peng, J. Attention-based context aggregation network for monocular depth estimation. Int. J. Mach. Learn. Cybern. 2021, 12, 1583–1596. [Google Scholar] [CrossRef]
- Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; Ricci, E. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3917–3925. [Google Scholar]
- Chen, T.; An, S.; Zhang, Y.; Ma, C.; Wang, H.; Guo, X.; Zheng, W. Improving monocular depth estimation by leveraging structural awareness and complementary datasets. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 90–108. [Google Scholar]
- Cao, Y.; Wu, Z.; Shen, C. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 3174–3182. [Google Scholar] [CrossRef]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011. [Google Scholar]
- Uhrig, J.; Schneider, N.; Schneider, L.; Franke, U.; Brox, T.; Geiger, A. Sparsity invariant cnns. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 11–20. [Google Scholar]
- Jaritz, M.; De Charette, R.; Wirbel, E.; Perrotton, X.; Nashashibi, F. Sparse and dense data with cnns: Depth completion and semantic segmentation. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 52–60. [Google Scholar]
- Hu, M.; Wang, S.; Li, B.; Ning, S.; Fan, L.; Gong, X. Penet: Towards precise and efficient image guided depth completion. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13656–13662. [Google Scholar]
- Cheng, X.; Wang, P.; Yang, R. Learning depth with convolutional spatial propagation network. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2361–2379. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.; Cui, Z.; Xu, C.; Jie, Z.; Li, X.; Yang, J. Joint task-recursive learning for RGB-D scene understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2608–2623. [Google Scholar] [CrossRef] [PubMed]
- Zhu, S.; Brazil, G.; Liu, X. The edge of depth: Explicit constraints between segmentation and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13116–13125. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Ma, F.; Karaman, S. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 4796–4803. [Google Scholar]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. Nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Ma, F.; Cavalheiro, G.V.; Karaman, S. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3288–3295. [Google Scholar]
- Lin, J.T.; Dai, D.; Van Gool, L. Depth estimation from monocular images and sparse radar data. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10233–10240. [Google Scholar]
Method | MAE | RMSE | REL | Params (MB) | FPS | |||
---|---|---|---|---|---|---|---|---|
Ma et al. [46] | 3.430 | 7.195 | 0.164 | 0.809 | 0.916 | 0.959 | 26.107 | 3.690 |
Hu et al. [38] | 3.630 | 6.882 | 0.187 | 0.779 | 0.916 | 0.963 | 131.919 | 2.966 |
Lin et al. [47] | 2.640 | 5.889 | 0.118 | 0.874 | 0.950 | 0.976 | 29.422 | 22.625 |
Ours with Seg | 2.424 | 5.516 | 0.112 | 0.887 | 0.956 | 0.979 | 31.950 | 14.182 |
Road | Tree | Sky | ||||
---|---|---|---|---|---|---|
MAE | RESE | MAE | RESE | MAE | RESE | |
0.0 | 2.351 | 4.978 | 4.889 | 8.262 | 6.931 | 8.749 |
0.2 | 1.053 | 3.062 | 4.515 | 7.914 | 6.717 | 8.597 |
0.4 | 1.021 | 3.023 | 4.559 | 7.974 | 6.691 | 8.609 |
0.6 | 0.984 | 2.990 | 4.655 | 8.056 | 6.899 | 8.719 |
0.8 | 0.968 | 2.962 | 4.524 | 7.959 | 6.331 | 8.149 |
1.0 | 0.932 | 2.922 | 4.480 | 7.895 | 6.956 | 9.074 |
1.2 | 0.931 | 2.902 | 4.637 | 8.057 | 7.170 | 9.243 |
1.4 | 0.905 | 2.853 | 4.457 | 7.845 | 6.575 | 8.449 |
1.6 | 0.905 | 2.856 | 4.746 | 8.228 | 7.158 | 9.032 |
1.8 | 1.015 | 2.943 | 5.396 | 8.843 | 9.019 | 10.924 |
2.0 | 0.953 | 2.921 | 9.950 | 15.089 | 18.233 | 20.668 |
Method | MAE | RMSE | REL | |||
---|---|---|---|---|---|---|
Single | 2.634 | 5.909 | 0.119 | 0.874 | 0.950 | 0.976 |
Add | 2.612 | 5.857 | 0.119 | 0.876 | 0.950 | 0.976 |
Conf | 2.606 | 5.875 | 0.120 | 0.876 | 0.950 | 0.976 |
Seg | 2.424 | 5.516 | 0.112 | 0.887 | 0.956 | 0.979 |
Mothed | 0–10 m | 10–30 m | 30–50 m | 50–100 m | ||||
---|---|---|---|---|---|---|---|---|
MAE | RESE | MAE | RESE | MAE | RESE | MAE | RESE | |
Single | 0.582 | 1.581 | 2.343 | 4.296 | 5.996 | 8.164 | 12.340 | 16.604 |
Add | 0.583 | 1.604 | 2.336 | 4.329 | 6.006 | 8.188 | 11.895 | 16.217 |
Conf | 0.597 | 1.657 | 2.357 | 4.423 | 5.678 | 7.893 | 11.700 | 15.923 |
Seg | 0.559 | 1.560 | 2.203 | 4.111 | 5.595 | 7.743 | 10.910 | 15.195 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, S.; Yan, J.; Chen, H.; Zheng, K. Radar-Camera Fusion Network for Depth Estimation in Structured Driving Scenes. Sensors 2023, 23, 7560. https://doi.org/10.3390/s23177560
Li S, Yan J, Chen H, Zheng K. Radar-Camera Fusion Network for Depth Estimation in Structured Driving Scenes. Sensors. 2023; 23(17):7560. https://doi.org/10.3390/s23177560
Chicago/Turabian StyleLi, Shuguang, Jiafu Yan, Haoran Chen, and Ke Zheng. 2023. "Radar-Camera Fusion Network for Depth Estimation in Structured Driving Scenes" Sensors 23, no. 17: 7560. https://doi.org/10.3390/s23177560
APA StyleLi, S., Yan, J., Chen, H., & Zheng, K. (2023). Radar-Camera Fusion Network for Depth Estimation in Structured Driving Scenes. Sensors, 23(17), 7560. https://doi.org/10.3390/s23177560