Multi-Modal 3D Object Detection in Autonomous Driving: A Survey

Yingjie Wang¹^na1,
Qiuyu Mao¹^na1,
Hanqi Zhu¹,
Jiajun Deng¹,
Yu Zhang¹,
Jianmin Ji¹,
Houqiang Li¹ &
…
Yanyong Zhang ORCID: orcid.org/0000-0001-6520-255X¹

5533 Accesses
45 Citations
2 Altmetric
Explore all metrics

Abstract

The past decade has witnessed the rapid development of autonomous driving systems. However, it remains a daunting task to achieve full autonomy, especially when it comes to understanding the ever-changing, complex driving scenes. To alleviate the difficulty of perception, self-driving vehicles are usually equipped with a suite of sensors (e.g., cameras, LiDARs), hoping to capture the scenes with overlapping perspectives to minimize blind spots. Fusing these data streams and exploiting their complementary properties is thus rapidly becoming the current trend. Nonetheless, combining data that are captured by different sensors with drastically different ranging/ima-ging mechanisms is not a trivial task; instead, many factors need to be considered and optimized. If not careful, data from one sensor may act as noises to data from another sensor, with even poorer results by fusing them. Thus far, there has been no in-depth guidelines to designing the multi-modal fusion based 3D perception algorithms. To fill in the void and motivate further investigation, this survey conducts a thorough study of tens of recent deep learning based multi-modal 3D detection networks (with a special emphasis on LiDAR-camera fusion), focusing on their fusion stage (i.e., when to fuse), fusion inputs (i.e., what to fuse), and fusion granularity (i.e., how to fuse). These important design choices play a critical role in determining the performance of the fusion algorithm. In this survey, we first introduce the background of popular sensors used for self-driving, their data properties, and the corresponding object detection algorithms. Next, we discuss existing datasets that can be used for evaluating multi-modal 3D object detection algorithms. Then we present a review of multi-modal fusion based 3D detection networks, taking a close look at their fusion stage, fusion input and fusion granularity, and how these design choices evolve with time and technology. After the review, we discuss open challenges as well as possible solutions. We hope that this survey can help researchers to get familiar with the field and embark on investigations in the area of multi-modal 3D object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 11

A Review of Image and Point Cloud Fusion in Autonomous Driving

RGB-D Object Classification for Autonomous Driving Perception

LiDAR-Camera-Based Deep Dense Fusion for Robust 3D Object Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Ahmad, W. A., Wessel, J., Ng, H. J., & Kissinger, D. (2020). IoT-ready millimeter-wave radar sensors. In IEEE global conference on artificial intelligence and Internet of Things (GCAIoT) (pp. 1–5).
Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3d pose estimation and tracking by detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 623–630).
Arnold, E., Al-Jarrah, O. Y., Dianati, M., Fallah, S., Oxtoby, D., & Mouzakitis, A. (2019). A survey on 3d object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems (TITS), 20(10), 3782–3795.
Article Google Scholar
Asvadi, A., Garrote, L., Premebida, C., Peixoto, P., & Nunes, U. (2017). Multimodal vehicle detection: Fusing 3d-lidar and color camera data. Pattern Recognition Letters,115, 20–29.
Asvadi, A., Garrote, L., Premebida, C., Peixoto, P., & Nunes, U. J. (2018). Multimodal vehicle detection: Fusing 3d-lidar and color camera data. Pattern Recognition Letters, 115, 20–29.
Article Google Scholar
Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., & Tai, C. L. (2022). Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1090–1099).
Beltrán, J., Guindel, C., Moreno, F. M., Cruzado, D., García, F., & De La Escalera, A. (2018). Birdnet: A 3d object detection framework from lidar information. In 2018 21st international conference on intelligent transportation systems (ITSC) (pp. 3517–3523).
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11618–11628).
Caine, B., Roelofs, R., Vasudevan, V., Ngiam, J., Chai, Y., Chen, Z., & Shlens, J. (2021). Pseudo-labeling for scalable 3d object detection. CoRR abs arXiv:2103.02093
Caltagirone, L., Bellone, M., Svensson, L., & Wahde, M. (2019). Lidar-camera fusion for road detection using fully convolutional neural networks. Robotics and Autonomous Systems, 111, 125–131.
Article Google Scholar
Carr, P., Sheikh, Y., & Matthews, I. (2012). Monocular object detection using 3d geometric primitives. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, & C. Schmid (Eds.), European conference on computer vision (ECCV) (pp. 864–878).
Chadwick, S., Maddern, W., & Newman, P. (2019). Distant vehicle detection using radar and vision. In IEEE international conference on robotics and automation (ICRA) (pp. 8311–8317).
Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., & Hays, J. (2019). Argoverse: 3d tracking and forecasting with rich maps. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8740–8749).
Charles, R. Q., Su, H., Kaichun, M., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 77–85).
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., & Urtasun, R. (2016). Monocular 3d object detection for autonomous driving. In 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, June 27–30, 2016 (pp. 2147–2156). IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.236
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., & Zhao, F. (2022b). Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. CoRR. arXiv:2207.10316
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F., Zhou, B., & Zhao, H. (2022c). AutoAlign: Pixel-instance feature aggregation for multi-modal 3d object detection. In IJCAI.
Chen, Y., Liu, J., Qi, X., Zhang, X., Sun, J., & Jia, J. (2022a). Scaling up kernels in 3d CNNs. arXiv preprint arXiv:2206.10555
Chen, X., Ma, H., Wan, J., Li, B., & Xia, T. (2017). Multi-view 3d object detection network for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1907–1915).
Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., & Urtasun, R. (2018). 3d object proposals using stereo imagery for accurate object class detection. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(5), 1259–1272.
Article Google Scholar
Chen, L., Lin, S., Lu, X., Cao, D., Wu, H., Guo, C., Liu, C., & Wang, F. Y. (2021). Deep neural network based vehicle and pedestrian detection for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems (TITS), 22(6), 3234–3246.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(4), 834–848.
Article Google Scholar
Chen, L., Zou, Q., Pan, Z., Lai, D., & Cao, D. (2019). Surrounding vehicle detection using an FPGA panoramic camera and deep CNNs. IEEE Transactions on Intelligent Transportation Systems, 21(12), 5110–5122.
Article Google Scholar
Chu, X., Deng, J., Li, Y., Yuan, Z., Zhang, Y., Ji, J., & Zhang, Y. (2021). Neighbor-vote: Improving monocular 3d object detection through neighbor distance voting. In ACM international conference on multimedia (ACM MM), ACM (pp. 5239–5247).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3213–3223).
Cui, Y., Chen, R., Chu, W., Chen, L., Tian, D., Li, Y., & Cao, D. (2021). Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Transactions on Intelligent Transportation Systems (TITS), 23, 1–18.
de Paula Veronese, L., Auat-Cheein, F., Mutz, F., Oliveira-Santos, T., Guivant, J. E., de Aguiar, E., Badue, C. & De Souza, A. F. (2020). Evaluating the limits of a lidar for an autonomous driving localization. IEEE Transactions on Intelligent Transportation Systems (TITS), 22(3), 1449–1458.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., & Li, H. (2020). Voxel R-CNN: Towards high performance voxel-based 3d object detection. arXiv:2012.15712
Deng, J., Zhou, W., Zhang, Y., & Li, H. (2021). From multi-view to hollow-3d: Hallucinated hollow-3d R-CNN for 3d object detection. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 31(12), 4722–4734.
Article Google Scholar
Denninger, M., Sundermeyer, M., Winkelbauer, D., Zidan, Y., Olefir, D., Elbadrawy, M., Lodhi, A., & Katam, H. (2019). BlenderProc. CoRR. arXiv:1911.01911.
Deschaud, J. E. (2021). KITTI-CARLA: A KITTI-like dataset generated by CARLA simulator. arXiv preprint arXiv:2109.00892
Ding, Z., Hu, Y., Ge, R., Huang, L., Chen, S., Wang, Y., & Liao, J. (2020). 1st place solution for Waymo open dataset challenge: 3d detection and domain adaptation. CoRR abs arXiv:2006.15505
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An open urban driving simulator. In Proceedings of the annual conference on robot learning (pp. 1–16)
Engelberg, T., & Niem, W. (2009). Method for classifying an object using a stereo camera. U.S. Patent App. 10/589,641.
Enzweiler, M., & Gavrila, D. M. (2009). Monocular pedestrian detection: Survey and experiments. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 31, 2179–2195.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Fan, L., Pang, Z., Zhang, T., Wang, Y. X., Zhao, H., Wang, F., Wang, N., & Zhang, Z. (2022). Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8458–8468).
Fan, L., Xiong, X., Wang, F., Wang, N., & Zhang, Z. (2021). RangeDet: In defense of range view for lidar-based 3d object detection. CoRR abs arXiv:2103.10039
Fayyad, J., Jaradat, M., Gruyer, D., & Najjaran, H. (2020). Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors, 20, 4220.
Article Google Scholar
Feng, D., Haase-Schütz, C., Rosenbaum, L., Hertlein, H., Gläser, C., Timm, F., Wiesbeck, W., & Dietmayer, K. (2021). Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems (TITS), 22(3), 1341–1360.
Gählert, N., Jourdan, N., Cordts, M., Franke, U., & Denzler, J. (2020). Cityscapes 3d: Dataset and benchmark for 9 DoF vehicle detection. CoRR. arXiv:2006.07864.
Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4340–4349).
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3354–3361).
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research (IJRR), 32(11), 1231–1237.
Article Google Scholar
Geiger, D., & Yuille, A. L. (1991). A common framework for image segmentation. International Journal on Computer Vision (IJCV), 6(3), 227–243.
Article Google Scholar
Girshick, R. (2015). Fast R-CNN. In IEEE international conference on computer vision (ICCV) (pp. 1440–1448).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Guan, T., Wang, J., Lan, S., Chandra, R., Wu, Z., Davis, L., & Manocha, D. (2022). M3DETR: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 772–782).
Guizilini, V., Li, J., Ambruş, R., & Gaidon, A. (2021). Geometric unsupervised domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8537–8547).
Guo, X., Shi, S., Wang, X., & Li, H. (2021). LIGA-Stereo: Learning lidar geometry aware representations for stereo-based 3d detector. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3153–3163).
Guo, J., Kurup, U., & Shah, M. (2019). Is it safe to drive? An overview of factors, metrics, and datasets for driveability assessment in autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 21(8), 3135–3151.
Article Google Scholar
He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In IEEE international conference on computer vision (ICCV) (pp. 2980–2988).
He, C., Zeng, H., Huang, J., Hua, X. S., & Zhang, L. (2020). Structure aware single-stage 3d object detection from point cloud. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, June 13–19,2020 (pp. 11870–11879). Computer Vision Foundation/IEEE.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
He, T., & Soatto, S. (2019). Mono3d++: Monocular 3d vehicle detection with two-scale 3d hypotheses and task priors. Association for the Advancement of Artificial Intelligence (AAAI), 33, 8409–8416.
Google Scholar
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Hodaň, T., Vineet, V., Gal, R., Shalev, E., Hanzelka, J., Connell, T., Urbina, P., Sinha, S. N., & Guenter, B. (2019). Photorealistic image synthesis for object instance detection. In 2019 IEEE international conference on image processing (ICIP), IEEE (pp. 66–70).
Hu, Y., Ding, Z., Ge, R., Shao, W., Huang, L., Li, K., & Liu, Q. (2021). AFDetV2: Rethinking the necessity of the second stage for object detection from point clouds. arXiv preprint arXiv:2112.09205
Hu, P., Ziglar, J., Held, D., & Ramanan, D. (2020). What you see is what you get: Exploiting visibility for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR), computer vision foundation/IEEE (pp. 10998–11006).
Huang, J., & Huang, G. (2022). BEVDet4D: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054
Huang, J., Huang, G., Zhu, Z., & Du, D. (2021). BEVDet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017a). Densely connected convolutional networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2261–2269).
Huang, P., Cheng, M., Chen, Y., Luo, H., Wang, C., & Li, J. (2017). Traffic sign occlusion detection using mobile laser scanning point clouds. IEEE Transactions on Intelligent Transportation Systems, 18(9), 2364–2376.
Article Google Scholar
Huang, T., Liu, Z., Chen, X., & Bai, X. (2020). EPNet: Enhancing point features with image semantics for 3d object detection. European Conference on Computer Vision (ECCV), 12360, 35–52.
Google Scholar
Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., & Yang, R. (2019). The apolloscape open dataset for autonomous driving and its application. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 42(10), 2702–2719.
Article Google Scholar
Ioannidou, A., Chatzilari, E., Nikolopoulos, S., & Kompatsiaris, I. (2017). Deep learning advances in computer vision with 3d data: A survey. ACM Computing Survey, 50(2), 20:1-20:38.
Google Scholar
Jiang, M., Wu, Y., & Lu, C. (2018). PointSIFT: A sift-like network module for 3d point cloud semantic segmentation. CoRR abs arXiv:1807.00652
Jiao, Y., Jie, Z., Chen, S., Chen, J., Wei, X., Ma, L., & Jiang, Y. G. (2022). MSMDfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. arXiv preprint arXiv:2209.03102
Kar, A., Prakash, A., Liu, M. Y., Cameracci, E., Yuan, J., Rusiniak, M., Acuna, D., Torralba, A., & Fidler, S. (2019). Meta-Sim: Learning to generate synthetic datasets. In IEEE international conference on computer vision (ICCV) (pp. 4550–4559).
Kellner, D., Klappstein, J., & Dietmayer, K. (2012). Grid-based DBSCAN for clustering extended objects in radar data. In IEEE intelligent vehicles symposium (IV) (pp. 365–370).
Kesten, R., Usman, M., Houston, J., Pandya, T., Nadhamuni, K., Ferreira, A., Yuan, M., Low, B., Jain, A., Ondruska, P., Omari, S., Shah, S., Kulkarni, A., Kazakova, A., Tao, C., Platinsky, L., Jiang, W., & Shet, V. (2019). Level 5 perception dataset 2020. https://level-5.global/level5/data/
Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, October 25–29, 2014 (pp. 1746–1751). ACL. https://doi.org/10.3115/v1/d14-1181
Kim, K., & Woo, W. (2005a). A multi-view camera tracking for modeling of indoor environment. Berlin.
Kim, K., & Woo, W. (2005b). A multi-view camera tracking for modeling of indoor environment. In K. Aizawa, Y. Nakamura & S. Satoh (Eds.), Advances in multimedia information processing—PCM 2004 (pp. 288–297).
Kim, Y., Choi, J.W., & Kum, D. (2020). GRIF Net: Gated region of interest fusion network for robust 3d object detection from radar point cloud and monocular image. In IROS (pp. 10857–10864).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS) (vol. 25).
Ku, J., Mozifian, M., Lee, J., Harakeh, A., & Waslander, S. L. (2018). Joint 3d proposal generation and object detection from view aggregation. In IEEE international conference on intelligent robots and systems (IROS) (pp. 1–8).
Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., & Beijbom, O. (2019). PointPillars: Fast encoders for object detection from point clouds. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 12697–12705).
Lee, S. (2020). Deep learning on radar centric 3d object detection. CoRR abs arXiv:2003.00851
Lee, C. H., Lim, Y. C., Kwon, S., & Lee, J. H. (2011). Stereo vision-based vehicle detection using a road feature and disparity histogram. Optical Engineering, 50(2), 027004–027004.
Article Google Scholar
Levinson, J., & Thrun, S. (2013). Automatic online calibration of cameras and lasers. In Robotics: Science and systems (vol. 2, p. 7).
Li, P., Chen, X., & Shen, S. (2019). Stereo R-CNN based 3d object detection for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7644–7652).
Li, Y., Yu, A. W., Meng, T., Caine, B., Ngiam, J., Peng, D., Shen, J., Lu, Y., Zhou, D., Le, Q. V., & Yuille, A. (2022). DeepFusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17182–17191).
Liang, M., Yang, B., Chen, Y., Hu, R., & Urtasun, R. (2019). Multi-task multi-sensor fusion for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7337–7345).
Liang, M., Yang, B., Wang, S., & Urtasun, R. (2018). Deep continuous fusion for multi-sensor 3d object detection. In European conference on computer vision (ECCV) (pp. 663–678).
Liang, Z., Zhang, M., Zhang, Z., Zhao, X., & Pu, S. (2020). RangeRCNN: Towards fast and accurate 3d object detection with range image representation. CoRR abs arXiv:2009.00206
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 936–944).
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), P.P.(99), 2999–3007
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV) (pp. 740–755).
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV) (pp. 21–37).
Liu, H., Simonyan, K., & Yang, Y. (2018). DARTS: Differentiable architecture search. CoRR. arXiv:1806.09055
Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D., & Han, S. (2022c). BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542
Liu, Y., Wang, T., Zhang, X., & Sun, J. (2022a). PETR: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625
Liu, Z., Wu, Z., & Tóth, R. (2020). SMOKE: Single-stage monocular 3d object detection via keypoint estimation. In IEEE conference on computer vision and pattern recognition workshops (CVPRW) (pp. 4289–4298).
Liu, Y., Yan, J., Jia, F., Li, S., Gao, Q., Wang, T., Zhang, X., & Sun, J. (2022b). PETRv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 39(4), 640–651.
Google Scholar
Lu, H., Chen, X., Zhang, G., Zhou, Q., Ma, Y., & Zhao, Y. (2019). SCANet: Spatial-channel attention network for 3d object detection. In IEEE international conference on acoustics, speech and, S.P. (ICASSP) (pp. 1992–1996).
Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., & Fan, X. (2019). Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In IEEE international conference on computer vision (ICCV) (pp. 6851–6860).
Mahmoud, A., Hu, J. S., & Waslander, S. L. (2022). Dense voxel fusion for 3d object detection. arXiv preprint arXiv:2203.00871
Major, B., Fontijne, D., Ansari, A., Sukhavasi, R. T., Gowaiker, R., Hamilton, M., Lee, S., & Grzechnik, S. K., Subramanian, S. (2019). Vehicle detection with automotive radar using deep learning on range-azimuth-doppler tensors. In IEEE international conference on computer vision workshop (ICCVW) (pp. 924–932).
Manivasagam, S., Wang, S., Wong, K., Zeng, W., Sazanovich, M., Tan, S., Yang, B., Ma, W. C., & Urtasun, R. (2020). LiDARsim: Realistic lidar simulation by leveraging the real world. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11167–11176).
Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., & Xu, C. (2021). Voxel transformer for 3d object detection. In 2021 IEEE/CVF international conference on computer vision (ICCV), Montreal, QC, Canada, October 10–17, 2021 (pp. 3144–3153). IEEE. https://doi.org/10.1109/ICCV48922.2021.00315.
Marchand, R., & Chaumette, F. (1999). An autonomous active vision system for complete and accurate 3d scene reconstruction. International Journal on Computer Vision (IJCV), 32(3), 171–194.
Article Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–33.
Article Google Scholar
Mousavian, A., Anguelov, D., Flynn, J., & Košecká, J. (2017). 3d bounding box estimation using deep learning and geometry. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5632–5640).
Nabati, R., & Qi, H. (2019). RRPN: Radar region proposal network for object detection in autonomous vehicles. In IEEE international conference on image processing (ICIP) (pp. 3093–3097).
Nabati, R., & Qi, H. (2021). CenterFusion: Center-based radar and camera fusion for 3d object detection. In IEEE winter conference on applications of computer vision (WACV) (pp. 1527–1536).
Nießner, M., Zollhöfer, M., Izadi, S., & Stamminger, M. (2013). Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (TOG), 32(6), 1–11.
Article Google Scholar
Pan, X., Xia, Z., Song, S., Li, L.E., & Huang, G. (2021). 3d object detection with pointformer. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7463–7472).
Pandey, G., McBride, J. R., Savarese, S., & Eustice, R. M. (2012). Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information. In Association for the advancement of artificial intelligence (AAAI) (pp. 2053–2059).
Pang, S., Morris, D., & Radha, H. (2020). CLOCs: Camera-lidar object candidates fusion for 3d object detection. In IEEE international conference on intelligent robots and systems (IROS) (pp. 10386–10393).
Park, D., Ambrus, R., Guizilini, V., Li, J., & Gaidon, A. (2021). Is pseudo-lidar needed for monocular 3d object detection? In IEEE international conference on computer vision (ICCV) (pp. 3142–3152).
Park, J. Y., Chu, C. W., Kim, H. W., Lim, S. J., Park, J. C., & Koo, B. K. (2009). Multi-view camera color calibration method using color checker chart. US Patent 12/334,095
Patil, A., Malla, S., Gang, H., & Chen, Y. T. (2019). The H3D dataset for full-surround 3D multi-object detection and tracking in crowded urban scenes. In IEEE international conference on robotics and automation (ICRA) (pp. 9552–9557).
Patole, S. M., Torlak, M., Wang, D., & Ali, M. (2017). Automotive radars: A review of signal processing techniques. IEEE Signal Processing Magazine, 34(2), 22–35.
Article Google Scholar
Pham, Q. H., Sevestre, P., Pahwa, R. S., Zhan, H., Pang, C. H., Chen, Y., Mustafa, A., Chandrasekhar, V., & Lin, J. (2020). A* 3d dataset: Towards autonomous driving in challenging environments. In IEEE international conference on robotics and automation (ICRA) (pp. 2267–2273).
Philion, J., & Fidler, S. (2020). Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European conference on computer vision (pp. 194–210). Springer.
Pon, A. D., Ku, J., Li, C., & Waslander, S. L. (2020). Object-centric stereo matching for 3d object detection. In IEEE international conference on robotics and automation (ICRA) (pp. 8383–8389).
Prakash, A., Boochoon, S., Brophy, M., Acuna, D., Cameracci, E., State, G., Shapira, O., & Birchfield, S. (2019). Structured domain randomization: Bridging the reality gap by context-aware synthetic data. In IEEE international conference on robotics and automation (ICRA) (pp. 7249–7255).
Qi, C. R., Litany, O., He, K., & Guibas, L. (2019). Deep Hough voting for 3d object detection in point clouds. In International conference on computer vision (ICCV) (pp. 9276–9285).
Qi, C. R., Liu, W., Wu, C., Su, H., & Guibas, L. J. (2018). Frustum PointNets for 3d object detection from RGB-D data. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 918–927).
Qi, C.R., Yi, L., Su, H., & Guibas, L. J. (2017). PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems (NeurIPS) (vol. 30).
Qian, K., Zhu, S., Zhang, X., & Li, L. E. (2021). Robust multimodal vehicle detection in foggy weather using complementary lidar and radar signals. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 444–453).
Qin, Z., Wang, J., & Lu, Y. (2019b). Triangulation learning network: From monocular to stereo 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7615–7623).
Qin, Z., Wang, J., & Lu, Y. (2019a). Monogrnet: A geometric reasoning network for monocular 3d object localization. Association for the Advancement of Artificial Intelligence (AAAI), 33, 8851–8858.
Google Scholar
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 779–788).
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 39(6), 1137–1149.
Article Google Scholar
Repairer Driven News (2018). Velodyne: Leading LIDAR price halved, new high-res product to improve self-driving cars. https://www.repairerdrivennews.com/2018/01/02/velodyne-leading-lidar-price-halved-new-high-res-product-to-improve-self-driving-cars/
Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV) (pp. 102–118).
Richter, S. R., Al Haija, H. A., & Koltun, V. (2022). Enhancing photorealism enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 1700–1715.
Article Google Scholar
Riegler, G., Ulusoy, A. O., & Geiger, A. (2017). OctNet: Learning deep 3d representations at high resolutions. In IEEE conference on computer vision and pattern recognition (CVPR) IEEE Computer Society (pp. 6620–6629).
Roddick, T., & Cipolla, R. (2020). Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11138–11147).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention (MICCAI), (vol. 9351, pp. 234–241).
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. M. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3234–3243).
Schlosser, J., Chow, C. K., & Kira, Z. (2016). Fusing lidar and images for pedestrian detection using convolutional neural networks. In IEEE international conference on robotics and automation (ICRA) (pp. 2198–2205).
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.
Article Google Scholar
Schneider, N., Piewak, F., Stiller, C., & Franke, U. (2017). RegNet: Multimodal sensor registration using deep neural networks. In IEEE intelligent vehicles symposium (IV) (pp. 1803–1810).
Sheeny, M., Pellegrin, E. D., Mukherjee, S., Ahrabian, A., Wang, S., & Wallace, A. M. (2021). RADIATE: A radar dataset for automotive perception. In IEEE international conference on robotics and automation (ICRA), Xi’an, China, May 30–June 5, 2021 (pp. 1–7). IEEE. https://doi.org/10.1109/ICRA48506.2021.9562089
Shi, W., & Rajkumar, R. (2020). Point-GNN: Graph neural network for 3d object detection in a point cloud. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1711–1719).
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., & Li, H. (2020a). PV-RCNN: Point-voxel feature set abstraction for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 10526–10535).
Shi, S., Wang, X., & Li, H. (2019). PointRCNN: 3d object proposal generation and detection from point cloud. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–779).
Shi, S., Wang, Z., Shi, J., Wang, X., & Li, H. (2020b). From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 43, 1–1.
Shin, K., Kwon, Y. P., & Tomizuka, M. (2019). RoarNet: A robust 3d object detection based on region approximation refinement. In IEEE intelligent vehicles symposium (IV) (pp. 2510–2515).
Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., & Dieleman, S. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529, 484–489.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations (ICLR) San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. arXiv:1409.1556
Sindagi, V. A., Zhou, Y., & Tuzel, O. (2019). MVX-Net: Multimodal voxelnet for 3d object detection. In IEEE international conference on robotics and automation (ICRA) (pp. 7276–7282).
Strecha, C., von Hansen, W., Van Gool, L., Fua, P., & Thoennessen, U. (2008). On benchmarking camera calibration and multi-view stereo for high resolution imagery. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo J, Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens J, Chen, Z., & Anguelov, D. (2020a). Scalability in perception for autonomous driving: Waymo open dataset. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, June 13–19, 2020 (pp. 2443–2451). Computer Vision Foundation/IEEE. https://doi.org/10.1109/CVPR42600.2020.00252
Sun, Y., Zuo, W., Yun, P., Wang, H., & Liu, M. (2020b). FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Transactions on Automation Science and Engineering, P.P.(99), 1–12.
Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H., & Han, S. (2020). Searching efficient 3d architectures with sparse point-voxel convolution. In European conference on computer vision (ECCV) (pp. 685–702).
Urmson, C., Anhalt, J., Bagnell, D., Baker, C., Bittner, R., Clark, M., Dolan, J., Duggins, D., Galatali, T., Geyer, C. & Gittleman, M. (2008). Autonomous driving in urban environments: Boss and the urban challenge. Journal of Field Robotics, 25(8), 425–466.
Urmson, C., Baker, C., Dolan, J., Rybski, P., Salesky, B., Whittaker, W. R., Ferguson, D., & Darms, M. (2009). Autonomous driving in traffic: Boss and the urban challenge. AI Magazine, 30(2), 17–28.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Lu., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 6000–6010.
Google Scholar
Vora, S., Lang, A. H., Helou, B., & Beijbom, O. (2020). PointPainting: Sequential fusion for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4603–4611).
Wallace, A. M., Halimi, A., & Buller, G. S. (2020). Full waveform lidar for adverse weather conditions. IEEE Transactions on Vehicular Technology (TVT), 69(7), 7064–7077.
Article Google Scholar
Wandinger, U. (2005). Introduction to lidar. Brooks/Cole Pub. Co.
Wang, Z., & Jia, K. (2019a). Frustum ConvNet: Sliding frustums to aggregate local point-wise features for amodal. In IEEE international conference on intelligent robots and systems (IROS) (pp. 1742–1749).
Wang, Z., & Jia, K. (2019b). Frustum ConvNet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In IEEE international conference on intelligent robots and systems (IROS) (pp. 1742–1749).
Wang, Y., Chao, W. L., Garg, D., Hariharan, B., Campbell, M., & Weinberger, K. Q. (2019). Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8437–8445).
Wang, X., Girshick, R. B., Gupta, A., & He, K. (2018). Non-local neural networks. In IEEE conference on computer vision and pattern recognition (CVPR), Computer Vision Foundation/IEEE Computer Society (pp. 7794–7803).
Wang, C., Ma, C., Zhu, M., & Yang, X. (2021). PointAugmenting: Cross-modal augmentation for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11794–11803).
Wang, S., Suo, S., Ma, W., Pokrovsky, A., & Urtasun, R. (2018). Deep parametric continuous convolutional neural networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2589–2597).
Wang, G., Tian, B., Zhang, Y., Chen, L., Cao, D., & Wu, J. (2020). Multi-view adaptive fusion network for 3D object detection. arXiv e-prints p arXiv:2011.00652
Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey. Neurocomputing, 312, 135–153.
Article Google Scholar
Wang, J., & Zhou, L. (2019). Traffic light recognition with high dynamic range imaging and deep learning. IEEE Transactions on Intelligent Transportation Systems, 20(4), 1341–1352.
Article Google Scholar
Weng, X., Man, Y., Cheng, D., Park, J., O.’Toole, M., & Kitani, K. (2020). All-in-one drive: A large-scale comprehensive perception dataset with high-density long-range point clouds. arXiv
Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S., Pan, B., Kumar, R., Hartnett, A., Pontes, J. K., Ramanan, D., Carr, P., & Hays, J. (2021). Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Proceedings of the neural information processing systems track on datasets and benchmarks (NeurIPS Datasets and Benchmarks 2021).
Wu, X., Peng, L., Yang, H., Xie, L., Huang, C., Deng, C., Liu, H., & Cai, D. (2022). Sparse fuse dense: Towards high quality 3d detection with depth completion. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5418–5427).
Xie, J., Kiefel, M., Sun, M. T., & Geiger, A. (2016). Semantic instance annotation of street scenes by 3d to 2d label transfer. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3688–3697).
Xie, L., Xiang, C., Yu, Z., Xu, G., Yang, Z., Cai, D., & He, X. (2020). PI-RCNN: An efficient multi-sensor 3d object detector with point-based attentive cont-conv fusion module. Association for the Advancement of Artificial Intelligence (AAAI), 34, 12460–12467.
Google Scholar
Xu, D., Anguelov, D., & Jain, A. (2018). PointFusion: Deep sensor fusion for 3d bounding box estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 244–253).
Xu, Q., Zhong, Y., & Neumann, U. (2021). Behind the curtain: Learning occluded shapes for 3d object detection. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI Thirty-Fourth Conference on InnovativeApplications of Artificial Intelligence (IAAI), The Twelveth Symposium on Educational Advances in Artificial Intelligence (EAAI) 2022 Virtual Event, February 22–March 1, 2022 (pp. 2893–2901). AAAI Press.
Yang, Z., Chen, J., Miao, Z., Li, W., Zhu, X., & Zhang, L. (2022b). DeepInteraction: 3d object detection via modality interaction. arXiv preprint arXiv:2208.11112
Yang, W., Li, Q., Liu, W., Yu, Y., Ma, Y., He, S., & Pan, J. (2021). Projecting your view attentively: Monocular road scene layout estimation via cross-view transformation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15536–15545).
Yang, H., Liu, Z., Wu, X., Wang, W., Qian, W., He, X., & Cai, D. (2022a). Graph R-CNN: Towards accurate 3d object detection with semantic-decorated local graph. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII. Lecture Notes in Computer Science (vol. 13668, pp. 662–679). Springer. https://doi.org/10.1007/978-3-031-20074-8_38
Yang, B., Luo, W., & Urtasun, R. (2018a). PIXOR: Real-time 3d object detection from point clouds. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7652–7660).
Yang, B., Luo, W., & Urtasun, R. (2018b). PIXOR: Real-time 3d object detection from point clouds. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7652–7660).
Yang, Z., Sun, Y., Liu, S., & Jia, J. (2020). 3DSSD: Point-based 3d single stage object detector. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11037–11045).
Yang, Z., Sun, Y., Liu, S., Shen, X., & Jia, J. (2018). IPOD: Intensive point-based object detector for point cloud. CoRR. arXiv:1812.05276
Yang, Z., Sun, Y., Liu, S., Shen, X., & Jia, J. (2019). STD: Sparse-to-dense 3d object detector for point cloud. In IEEE international conference on computer vision (ICCV) (pp. 1951–1960).
Yang, B., Guo, R., Liang, M., Casas, S., & Urtasun, R. (2020). RadarNet: Exploiting radar for robust perception of dynamic objects. European Conference on Computer Vision (ECCV), 12363, 496–512.
Google Scholar
Yan, Y., Mao, Y., & Li, B. (2018). SECOND: Sparsely embedded convolutional detection. Sensors, 18(10), 3337.
Article Google Scholar
Yin, T., Zhou, X., & Krähenbühl, P. (2021). Center-based 3d object detection and tracking. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11784–11793).
Yoo, J., Ahn, N., & Sohn, K. (2020a). Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8372–8381).
Yoo, J. H., Kim, Y., Kim, J., & Choi, J. W. (2020b). 3D-CVF: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In European conference on computer vision (ECCV) (pp. 720–736).
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (NeurIPS) (vol. 27).
You, Y., Wang, Y., Chao, W., Garg, D., Pleiss, G., Hariharan, B., Campbell, M. E., & Weinberger, K. Q. (2020). Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. In 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net
Zewge, N. S., Kim, Y., Kim, J., & Kim, J. H. (2019). Millimeter-wave radar and RGB-D camera sensor fusion for real-time people detection and tracking. In 2019 7th international conference on robot intelligence technology and applications (RiTA) (pp. 93–98).
Zhang, Y., Carballo, A., Yang, H., & Takeda, K. (2021b). Autonomous driving in adverse weather conditions: A survey. arXiv preprint arXiv:2112.08936
Zhang, Y., Carballo, A., Yang, H., & Takeda, K. (2021c). Autonomous driving in adverse weather conditions: A survey. CoRR abs arXiv:2112.08936
Zhang, W., Wang, Z., & Loy, C. C. (2020a). Multi-modality cut and paste for 3d object detection. arXiv:2012.12741
Zhang, H., Yang, D., Yurtsever, E., Redmill, K. A., & Özgüner, Ü. (2021a). Faraway-Frustum: Dealing with lidar sparsity for 3d object detection using fusion. In 24th IEEE international intelligent Transportation tystems conference (ITSC), Indianapolis, IN, USA, September 19–22, 2021 (pp. 2646–2652). IEEE. https://doi.org/10.1109/ITSC48978.2021.9564990
Zhang, Y., Zhang, S., Zhang, Y., Ji, J., Duan, Y., Huang, Y., Peng, J., & Zhang, Y. (2020). Multi-modality fusion perception and computing in autonomous driving. Journal of Computer Research and Development, 57(9), 1781.
Google Scholar
Zhao, X., Liu, Z., Hu, R., & Huang, K. (2019). 3d object detection using scale invariant and feature reweighting networks. In Association for the advancement of artificial intelligence (AAAI) (pp. 9267–9274).
Zhou, B., & Krähenbühl, P. (2022). Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13760–13769).
Zhou, Y., & Tuzel, O. (2018). VoxelNet: End-to-end learning for point cloud based 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4490–4499).
Zhou, Y., Wan, G., Hou, S., Yu, L., Wang, G., Rui, X., & Song, S. (2020). DA4AD: End-to-end deep attention-based visual localization for autonomous driving. In European conference on computer vision (ECCV) (pp. 271–289).
Zhu, H., Deng, J., Zhang, Y., Ji, J., Mao, Q., Li, H., & Zhang, Y. (2022). VPFNet: Improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Transactions on Multimedia (TMM). https://doi.org/10.1109/TMM.2022.3189778

Download references

Acknowledgements

This work was supported by the Anhui Province Development and Reform Commission 2020 New Energy Vehicle Industry Innovation Development Project.

Funding

The funding was provided by National Key Research and Development Program of China (Grant No. 2018AAA0100500).

Author information

Yingjie Wang and Qiuyu Mao have contributed equally.

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Yingjie Wang, Qiuyu Mao, Hanqi Zhu, Jiajun Deng, Yu Zhang, Jianmin Ji, Houqiang Li & Yanyong Zhang

Authors

Yingjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qiuyu Mao
View author publications
You can also search for this author in PubMed Google Scholar
Hanqi Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Deng
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianmin Ji
View author publications
You can also search for this author in PubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yanyong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanyong Zhang.

Additional information

Communicated by Slobodan Ilic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Mao, Q., Zhu, H. et al. Multi-Modal 3D Object Detection in Autonomous Driving: A Survey. Int J Comput Vis 131, 2122–2152 (2023). https://doi.org/10.1007/s11263-023-01784-z

Download citation

Received: 02 March 2022
Accepted: 06 March 2023
Published: 17 May 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11263-023-01784-z

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Review of Image and Point Cloud Fusion in Autonomous Driving

RGB-D Object Classification for Autonomous Driving Perception

LiDAR-Camera-Based Deep Dense Fusion for Robust 3D Object Detection

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-Modal 3D Object Detection in Autonomous Driving: A Survey

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Review of Image and Point Cloud Fusion in Autonomous Driving

RGB-D Object Classification for Autonomous Driving Perception

LiDAR-Camera-Based Deep Dense Fusion for Robust 3D Object Detection

Explore related subjects

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation