[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

DMFusion: LiDAR-camera fusion framework with depth merging and temporal aggregation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Multimodal 3D object detection is an active research topic in the field of autonomous driving. Most existing methods utilize both camera and LiDAR modalities but fuse their features through simple and insufficient mechanisms. Additionally, these approaches lack reliable positional and temporal information due to their reliance on single-frame camera data. In this paper, a novel end-to-end framework for 3D object detection was proposed to solve these problems through spatial and temporal fusion. The spatial information of bird’s-eye view (BEV) features is enhanced by integrating depth features from point clouds during the conversion of image features into 3D space. Moreover, positional and temporal information is augmented by aggregating multi-frame features. This framework is named as DMFusion, which consists of the following components: (i) a novel depth fusion view transform module (referred to as DFLSS), (ii) a simple and easily adjustable temporal fusion module based on 3D convolution (referred to as 3DMTF), and (iii) a LiDAR-temporal fusion module based on channel attention mechanism. On the nuScenes benchmark, DMFusion improves mAP by 1.42% and NDS by 1.26% compared with the baseline model, which demonstrates the effectiveness of our proposed method. The code will be released at https://github.com/lilkeker/DMFusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability and access

The data used in this study is sourced from the nuScenes dataset. The nuScenes dataset is publicly available and can be accessed through their official website at https://www.nuscenes.org/nuscenes. Researchers and interested parties can obtain the data by following the access instructions provided on the dataset’s website. The code that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Chen X, Zhang T, Wang Y, Wang Y, Zhao H (2023) Futr3d: A unified sensor fusion framework for 3d detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 172–181

  2. Li Y, Yu AW, Meng T, Caine B, Ngiam J, Peng D, Shen J, Lu Y, Zhou D, Le QV et al (2022) Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17182–17191

  3. Bai X, Hu Z, Zhu X, Huang Q, Chen Y, Fu H, Tai CL (2022) Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1090–1099

  4. Liang T, Xie H, Yu K, Xia Z, Lin Z, Wang Y, Tang T, Wang B, Tang Z (2022) Bevfusion: A simple and robust lidar-camera fusion framework. Adv Neural Inf Process Syst 35:10421–10434

    Google Scholar 

  5. Liu Z, Tang H, Amini A, Yang X, Mao H, Rus DL, Han S (2023) Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE

  6. Li Y, Chen Y, Qi X, Li Z, Sun J, Jia J (2022) Unifying voxel-based representation with transformer for 3d object detection. Adv Neural Inf Process Syst 35:18442–18455

    Google Scholar 

  7. Philion J, Fidler S (2020) Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D, pp. 194–210. https://doi.org/10.1007/978-3-030-58568-6_12

  8. Zhou S, Liu W, Hu C, Zhou S, Ma C (2023) Unidistill: A universal cross-modality knowledge distillation framework for 3d object detection in bird’s-eye view. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5116–5125

  9. Cai H, Zhang Z, Zhou Z, Li Z, Ding W, Zhao J (2023) Bevfusion4d: Learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv:2303.17099

  10. Li Y, Ge Z, Yu G, Yang J, Wang Z, Shi Y, Sun J, Li Z (2023) Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. Proceedings of the AAAI Conference on Artificial Intelligence 37:1477–1485

    Article  Google Scholar 

  11. Zeng Y, Zhang D, Wang C, Miao Z, Liu T, Zhan X, Hao D, Ma C (2022) Lift: Learning 4d lidar image fusion transformer for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17172–17181

  12. Piergiovanni A, Casser V, Ryoo MS, Angelova A (2021) 4d-net for learned multi-modal alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15435–15445

  13. Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2019.00086

  14. Qi CR, Yi L, Su H, Guibas LJ (2017) Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30

  15. Shi W, Rajkumar R (2020) Point-gnn: Graph neural network for 3d object detection in a point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1711–1719

  16. Yan Y, Mao Y, Li B (2018) Second: Sparsely embedded convolutional detection. Sensors 3337. https://doi.org/10.3390/s18103337

  17. Lang AH, Vora S, Caesar H, Zhou L, Yang J, Beijbom O (2019) Pointpillars: Fast encoders for object detection from point clouds. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2019.01298

  18. Sheng H, Cai S, Liu Y, Deng B, Huang J, Hua XS, Zhao MJ (2021) Improving 3d object detection with channel-wise transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752

  19. Tian Z, Chu X, Wang X, Wei X, Shen C (2022) Fully convolutional one-stage 3d object detection on lidar range images. Adv Neural Inf Process Syst 35:34899–34911

    Google Scholar 

  20. Fan L, Xiong X, Wang F, Wang N, Zhang Z (2021) Rangedet: In defense of range view for lidar-based 3d object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv48922.2021.00291

  21. Meyer GP, Laddha A, Kee E, Vallespi-Gonzalez C, Wellington CK (2019) Lasernet: An efficient probabilistic 3d object detector for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2019.01296

  22. Wang T, Zhu X, Pang J, Lin D (2021) Fcos3d: Fully convolutional one-stage monocular 3d object detection. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). https://doi.org/10.1109/iccvw54120.2021.00107

  23. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-End Object Detection with Transformers, pp. 213–229. https://doi.org/10.1007/978-3-030-58452-8_13

  24. Wang Y, Guizilini VC, Zhang T, Wang Y, Zhao H, Solomon J (2022) Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on Robot Learning, pp. 180–191. PMLR

  25. Wang YL, Zhang X, Sun J (2022) Petr: Position embedding transformation for multi-view 3d object detection. In: European Conference on Computer Vision, pp. 531–548. Springer

  26. Liu Y, Yan J, Jia F, Gao SLA, Wang T, Zhang X (2023) Petrv2: A unified framework for 3d perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272

  27. Reading C, Harakeh A, Chae J, Waslander SL (2021) Categorical depth distribution network for monocular 3d object detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr46437.2021.00845

  28. Roddick T, Kendall A, Cipolla R (2018) Orthographic feature transform for monocular 3d object detection. British Machine Vision Conference

  29. Li Z, Wang W, Li H, Xie E, Sima C, Lu T, Qiao Y, Dai J (2022) Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Lecture Notes in Computer Science, Computer Vision–ECCV 2022, pp. 1–18. https://doi.org/10.1007/978-3-031-20077-9_1

  30. Yang C, Chen Y, Tian H, Tao C, Zhu X, Zhang Z, Huang G, Li H, Qiao Y, Lu L et al (2023) Bevformerv2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839

  31. Vora S, Lang AH, Helou B, Beijbom O (2020) Pointpainting: Sequential fusion for 3d object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr42600.2020.00466

  32. Wang C, Ma C, Yang MZ (2021) Pointaugmenting: Cross-modal augmentation for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11794–11803

  33. Huang T, Liu Z, Chen X, Bai X (2020) Epnet: Enhancing point features with image semantics for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pp. 35–52. Springer

  34. Yang Z, Chen J, Miao Z, Li W, Zhu X, Zhang L (2022) Deepinteraction: 3d object detection via modality interaction. Advances in Neural Information Processing Systems 35

  35. Chen S, Wang X, Cheng T, Zhang Q, Huang C, Liu W (2022) Polar parametrization for vision-based surround-view 3d detection. arXiv:2206.10965

  36. Gu W, Ai R, Liu J, Fan L, Cao D, Zhang K (2022) Application of dynamic deformable attention in bird’s-eye-view detection. IEEE Journal of Radio Frequency Identification 6:886–890

  37. Qin Z, Chen J, Chen C, Chen X, Li X (2023) Unifusion: Unified multi-view fusion transformer for spatial-temporal representation in bird’s-eye-view, 8690–8699

  38. Huang J, Huang G (2022) Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv:2203.17054

  39. Zhiqiang Cao JL, Yang J, Liu X, Yang Y, Qu Z (2023) Bird’s-eye-view semantic segmentation with two-stream compact depth transformation and feature rectification. IEEE Transactions on Intelligent Vehicles 8(11):4546–4558. https://doi.org/10.1109/TIV.2023.3275993

    Article  Google Scholar 

  40. Li Y, Bao H, Ge Z, Yang J, Sun J, Li Z (2023) Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. Proceedings of the AAAI Conference on Artificial Intelligence 37:1486–1494

    Article  Google Scholar 

  41. Zhou Z, Du L, Ye X, Zou Z, Tan X, Zhang L, Xue X, Feng J (2022) Sgm3d: Stereo guided monocular 3d object detection. IEEE Robotics and Automation Letters 7(4):10478–10485

    Article  Google Scholar 

  42. Yin T, Zhou X, Philipp K (2021) Center-based 3d object detection and tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr46437.2021.01161

  43. Zhou Y, Tuzel O (2018) Voxelnet: End-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499

  44. Koh J, Lee J, Lee Y, Kim J, Choi JW (2023) Mgtanet: Encoding sequential lidar points using long short-term motion-guided temporal attention for 3d object detection. Proceedings of the AAAI Conference on Artificial Intelligence 37:1179–1187

  45. Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988

  46. Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O (2020) nuscenes: A multimodal dataset for autonomous driving. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr42600.2020.01164

  47. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L al (2019) Pytorch: An imperative style, high-performance deep learning library

  48. Contributors M (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d

  49. Liang T, Chu X, Liu Y, Wang Y, Tang Z, Chu W, Chen J, HaibinLing (2022) Cbnet: A composite backbone network architecture for object detection. IEEE Trans Image Process 31:6893–6906

  50. Chen Z, Li Z, Zhang S, Fang L, Jiang Q, Zhao F (2022) Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. arXiv:2207.10316

  51. Xu S, Zhou D, Fang J, Yin J, Bin Z, Zhang L (2021) Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) . https://doi.org/10.1109/itsc48978.2021.9564951

Download references

Funding

This research was supported by the Baima Lake Laboratory Joint Funds of the Zhejiang Provincial Natural Science Foundation of China under Grant No. LBMHD24F030002 and the National Natural Science Foundation of China under Grant 62373329.

Author information

Authors and Affiliations

Authors

Contributions

Xinyi Yu: Conceived the research idea and conducted experiments. Ke Lu: Contributed to the conceptualization of the study and participated in experimental work. Yang Yang: Played a significant role in the research; specific contributions include conducting statistical analysis and developing theoretical models. Linlin Ou: Contributed to manuscript writing, organization, and also participated in experiments.

Corresponding author

Correspondence to Linlin Ou.

Ethics declarations

Competing Interests

The authors declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. The authors declare that they have no conflicts of interest to this work.

Ethical statement

Informed consent was obtained from all human participants involved in the study

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, X., Lu, K., Yang, Y. et al. DMFusion: LiDAR-camera fusion framework with depth merging and temporal aggregation. Appl Intell 54, 9412–9428 (2024). https://doi.org/10.1007/s10489-024-05627-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05627-3

Keywords

Navigation