More Web Proxy on the site http://driver.im/

research-article

CRAFT: camera-radar 3D object detection with spatio-contextual fusion transformer

AUTHORs:

Dongsuk KumAuthors Info & Claims

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 129, Pages 1160 - 1168

https://doi.org/10.1609/aaai.v37i1.25198

Published: 07 February 2023 Publication History

Abstract

Camera and radar sensors have significant advantages in cost, reliability, and maintenance compared to LiDAR. Existing fusion methods often fuse the outputs of single modalities at the result-level, called the late fusion strategy. This can benefit from using off-the-shelf single sensor detection algorithms, but late fusion cannot fully exploit the complementary properties of sensors, thus having limited performance despite the huge potential of camera-radar fusion. Here we propose a novel proposal-level early fusion approach that effectively exploits both spatial and contextual properties of camera and radar for 3D object detection. Our fusion framework first associates image proposal with radar points in the polar coordinate system to efficiently handle the discrepancy between the coordinate system and spatial properties. Using this as a first stage, following consecutive cross-attention based feature fusion layers adaptively exchange spatio-contextual information between camera and radar, leading to a robust and attentive fusion. Our camera-radar fusion approach achieves the state-of-the-art 41.1% mAP and 52.3% NDS on the nuScenes test set, which is 8.7 and 10.8 points higher than the camera-only baseline, as well as yielding competitive performance on the LiDAR method.

References

[1]

Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; and Tai, C.-L. 2022. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1090-1099.

[2]

Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11621-11631.

[3]

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), 213-229.

[4]

Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; and Tao, D. 2018. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2002-2011.

[5]

Huang, T.; Liu, Z.; Chen, X.; and Bai, X. 2020. Epnet: Enhancing point features with image semantics for 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), 35-52.

[6]

Hung, W.-C.; Kretzschmar, H.; Casser, V.; Hwang, J.-J.; and Anguelov, D. 2022. Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection. In arXiv preprint arXiv:2206.07705.

[7]

Johnson, D. H.; and Dudgeon, D. E. 1992. Array signal processing: concepts and techniques. Simon & Schuster, Inc.

[8]

Kendall, A.; and Gal, Y. 2017. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Advances in Neural Information Processing Systems (NeurIPS), 5574-5584.

[9]

Kim, J.; Kim, Y.; and Kum, D. 2020. Low-level sensor fusion network for 3D vehicle detection using radar range-azimuth heatmap and monocular image. In Proceedings of the Asian Conference on Computer Vision (ACCV), 388-402.

[10]

Kim, Y.; Choi, J. W.; and Kum, D. 2020. GRIF Net: Gated region of interest fusion network for robust 3D object detection from radar point cloud and monocular image. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10857-10864.

[11]

Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; and Waslander, S. L. 2018. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 5750-5757.

[12]

Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; and Beijbom, O. 2019. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12697-12705.

[13]

Law, H.; and Deng, J. 2018. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), 734-750.

[14]

Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; and Dai, J. 2022. BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. In Proceedings of the European Conference on Computer Vision (ECCV).

[15]

Lim, T.-Y.; Ansari, A.; Major, B.; Fontijne, D.; Hamilton, M.; Gowaikar, R.; and Subramanian, S. 2019. Radar and camera early fusion for vehicle detection in advanced driver assistance systems. In Advances in Neural Information Processing Systems Workshops (NeurIPSW).

[16]

Lin, J.-T.; Dai, D.; and Van Gool, L. 2020. Depth estimation from monocular images and sparse radar data. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10233-10240.

Digital Library

[17]

Lin, Y.; Le Kernec, J.; Yang, S.; Fioranelli, F.; Romain, O.; and Zhao, Z. 2018. Human activity classification with radar: Optimization and noise robustness with iterative convolutional neural networks followed with random forests. IEEE Sensors Journal, 18(23): 9669-9681.

[18]

Liu, Y.; Wang, T.; Zhang, X.; and Sun, J. 2022. PETR: Position Embedding Transformation for Multi-View 3D Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), 531-548.

[19]

Long, Y.; Morris, D.; Liu, X.; Castro, M.; Chakravarty, P.; and Narayanan, P. 2021a. Full-Velocity Radar Returns by Radar-Camera Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 16198-16207.

[20]

Long, Y.; Morris, D.; Liu, X.; Castro, M.; Chakravarty, P.; and Narayanan, P. 2021b. Radar-camera pixel depth association for depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12507-12516.

[21]

Ma, X.; Zhang, Y.; Xu, D.; Zhou, D.; Yi, S.; Li, H.; and Ouyang, W. 2021. Delving into Localization Errors for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4721-4730.

[22]

Major, B.; Fontijne, D.; Ansari, A.; Teja Sukhavasi, R.; Gowaikar, R.; Hamilton, M.; Lee, S.; Grzechnik, S.; and Subramanian, S. 2019. Vehicle detection with automotive radar using deep learning on range-azimuth-doppler tensors. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 924-932.

[23]

Meyer, M.; and Kuschk, G. 2019a. Automotive radar dataset for deep learning based 3d object detection. In Proceedings of the European Radar Conference (EuRAD), 129-132.

[24]

Meyer, M.; and Kuschk, G. 2019b. Deep learning based 3d object detection for automotive radar and camera. In Proceedings of the European Radar Conference (EuRAD), 133-136.

[25]

Misra, I.; Girdhar, R.; and Joulin, A. 2021. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2906-2917.

[26]

Nabati, R.; and Qi, H. 2021. Centerfusion: Center-based radar and camera fusion for 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 1527-1536.

[27]

Pan, X.; Xia, Z.; Song, S.; Li, L. E.; and Huang, G. 2021. 3d object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7463-7472.

[28]

Park, D.; Ambrus, R.; Guizilini, V.; Li, J.; and Gaidon, A. 2021. Is Pseudo-Lidar needed for Monocular 3D Object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3142-3152.

[29]

Philion, J.; and Fidler, S. 2020. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision (ECCV), 194-210.

[30]

Qi, C. R.; Liu, W.; Wu, C.; Su, H.; and Guibas, L. J. 2018. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 918-927.

[31]

Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 652-660.

[32]

Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Point-net++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems (NeurIPS), 5105-5114.

[33]

Reading, C.; Harakeh, A.; Chae, J.; and Waslander, S. L. 2021. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8555-8564.

[34]

Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems (NeurIPS), 91-99.

[35]

Shi, S.; Wang, X.; and Li, H. 2019. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 770-779.

[36]

Svenningsson, P.; Fioranelli, F.; and Yarovoy, A. 2021. Radar-pointgnn: Graph based object recognition for unstructured radar point-cloud data. In Proceedings of the IEEE Radar Conference (RadarConf), 1-6.

[37]

Thomas, H.; Qi, C. R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; and Guibas, L. J. 2019. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 6411-6420.

[38]

Tian, Z.; Shen, C.; Chen, H.; and He, T. 2019. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 9627-9636.

[39]

Ulrich, M.; Braun, S.; Köhler, D.; Niederlöhner, D.; Faion, F.; Glaser, C.; and Blume, H. 2022. Improved Orientation Estimation and Detection with Hybrid Object Detection Networks for Automotive Radar. In arXiv preprint arXiv:2205.02111.

[40]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 6000-6010.

[41]

Vora, S.; Lang, A. H.; Helou, B.; and Beijbom, O. 2020. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4604-4612.

[42]

Wang, T.; Xinge, Z.; Pang, J.; and Lin, D. 2021a. Probabilistic and geometric depth: Detecting objects in perspective. In Proceedings of the Conference on Robot Learning (CoRL), 1475-1485.

[43]

Wang, T.; Zhu, X.; Pang, J.; and Lin, D. 2021b. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 913-922.

[44]

Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; and Liu, T. 2020. On layer normalization in the transformer architecture. In Proceedings of the International Conference on Machine Learning (ICML), 10524-10533.

[45]

Yang, Z.; Sun, Y.; Liu, S.; and Jia, J. 2020. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11040-11048.

[46]

Yin, T.; Zhou, X.; and Krahenbuhl, P. 2021. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11784-11793.

[47]

Yoo, J. H.; Kim, Y.; Kim, J.; and Choi, J. W. 2020. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), 720-736.

[48]

Yu, F.; Wang, D.; Shelhamer, E.; and Darrell, T. 2018. Deep Layer Aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2403-2412.

[49]

Zhou, X.; Wang, D.; and Krähenbuhl, P. 2019. Objects as Points. In arXiv preprint arXiv:1904.07850.

[50]

Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2021. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations (ICLR).

Cited By

Xu SJiang SLi FLiu LSong ZYang BYang ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SparseInteraction: Sparse Semantic Guidance for Radar and Camera 3D Object DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681565(9224-9233)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681565
Liu YWang FWang NZhang ZOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Echoes beyond pointsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668469(53964-53982)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668469

Recommendations

An evidence clustering DSmT approximate reasoning method for more than two sources

Due to the huge computation complexity of Dezert-Smarandache Theory (DSmT), its applications especially for multi-source (more than two sources) complex fusion problems have been limited. To get high similar approximate reasoning results with ...
A New Multi-source Image Sequence Fusion Algorithm Based on SIDWT
ICIG '13: Proceedings of the 2013 Seventh International Conference on Image and Graphics

A new fusion method of infrared and visible video sequence is proposed based on the shift-invariant discrete wavelet transformation (SIDWT). Firstly the approximate target regions of each single-frame infrared image are detected by weighted information ...
Multiple input to multiple output images fusion based on turbo iteration
Special issue on advances in multidimensional synthetic aperture radar signal processing

This paper mainly addresses the problem of multipolar Synthetic Aperture Radar (SAR) and colorful optical images fusion by regarding them as multichannel images. Based on traditional wavelet-based and model-based fusion algorithms, the paper proposes a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

February 2023

16496 pages

ISBN:978-1-57735-880-0

Copyright © 2023 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu SJiang SLi FLiu LSong ZYang BYang ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SparseInteraction: Sparse Semantic Guidance for Radar and Camera 3D Object DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681565(9224-9233)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681565
Liu YWang FWang NZhang ZOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Echoes beyond pointsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668469(53964-53982)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668469

View Options

View options

Media

Figures

Other

Tables

View Table of Contents