ORION: A Holistic End-to-End Autonomous Driving Framework
by Vision-Language Instructed Action Generation
Haoyu Fu1*, Diankun Zhang2*, Zongchuang Zhao1*, Jianfeng Cui2, Dingkang Liang1†,
Chong Zhang2, Dingyuan Zhang1, Hongwei Xie2†, Bing Wang2, Xiang Bai1
1 Huazhong University of Science & Technology, 2 Xiaomi EV
(*) Equal contribution. (†) Project leader.
End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a hOlistic E2E autonomous dRiving framework by vIsion-language instructed actiON generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.
[2025/04/10]
ORION inference code and checkpoint release.
[2025/03/26]
ArXiv paper release.
- ORION Inference Framework
- Open-loop Evaluation
- Close-loop Evalution
- ORION Checkpoint
- Chat-B2D Dataset
- ORION Training Framework
git clone https://github.com/xiaomi-mlab/Orion.git
cd ./ORION
conda create -n orion python=3.8 -y
conda activate orion
pip install torch==2.4.1+cu118 torchvision==0.19.1+cu118 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
pip install -v -e .
pip install -r requirements.txt
You can refer to here to prepare the Bench2drive dataset.
ORION uses the pretrained 2D llm weights and vision encoder + projector weights provided by Omnidrive
cd /path/to/OmniDrive
mkdir ckpts
The vision encoder + projector weights are extracted from ckpts/pretrain_qformer/, which is pretrained by using llava data.
You can perform an open-loop evaluation of ORION with the following command
./adzoo/orion/orion_dist_eval.sh adzoo/orion/configs/orion_stage3.py [--PATH_CHECKPOINTS] 1
You also can perform a CoT inference of ORION with (this might be quite slow)
./adzoo/orion/orion_dist_eval.sh adzoo/orion/configs/orion_stage3_cot.py [--PATH_CHECKPOINTS] 1
We recommend inference for ORION on an NVIDIA A100 or other GPUs with more than 32GB of memory (inference in FP32, as default).
Meanwhile, Orion can also perform FP16 inference and achieve almost the same performance. We recommend fp16 inference on a GPU with more than 17GB of memory.
./adzoo/orion/orion_dist_eval.sh adzoo/orion/configs/orion_stage3_fp16.py [--PATH_CHECKPOINTS] 1
You can refer to here to clone Bench2Drive evaluation tools and prepare CARLA for it.
Follow here to use evaluation tools of Bench2Drive.
Note that you may first verify the correctness of the team agent, you need to set GPU_RANK, TEAM_AGENT, TEAM_CONFIG in the eval scripts.
You can set as following for close-loop evaluation
TEAM_CONFIG=adzoo/orion/configs/orion_stage3_agent.py+[CHECKPOINT_PATH]
The results of UniAD & VAD are refer to the official results of Bench2DriveZoo
Method | L2 (m) 2s | Driving Score | Success Rate(%) | Config | Download | Eval Json |
---|---|---|---|---|---|---|
UniAD-Tiny | 0.80 | 40.73 | 13.18 | config | Hugging Face/Baidu Cloud | Json |
UniAD-Base | 0.73 | 45.81 | 16.36 | config | Hugging Face/Baidu Cloud | Json |
VAD | 0.91 | 42.35 | 15.00 | config | Hugging Face/Baidu Cloud | Json |
ORION | 0.68 | 77.74 | 54.62 | config | Hugging Face | Json |
We provide some visualization videos and qualitatively analysis for Orion and compared them with TCP-traj, UniAD-Base, VAD-Base at here.
If this work is helpful for your research, please consider citing:
@article{fu2025orion,
title={ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation},
author={Haoyu Fu and Diankun Zhang and Zongchuang Zhao and Jianfeng Cui and Dingkang Liang and Chong Zhang and Dingyuan Zhang and Hongwei Xie and Bing Wang and Xiang Bai},
journal={arXiv:2503.19755},
year={2025}
}