SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models
* Equal contribution
We propose SORT3D, an LLM-based object-centric grounding and indoor navigation system employing a spatial reasoning toolbox and state-of-the-art 2D VLMs for perception. The toolbox is capable of interpreting both direct and indirect statements about spatial relations, using an LLM for high-level reasoning and guiding the autonomous robot to navigate through the environment. It has demonstrated the best zero-shot performance on spatial reasoning benchmarks. To the best of our knowledge, this is the first implementation of a general spatial relation toolbox for autonomous vision-language navigation that is fully integrated into real-robot systems.
wheelchair_VLA_with_rviz.mp4
This repository is set up to run both grounding evaluation on the ReferIt3D and VLA-3D benchmarks and online navigation, on both real robots and provided simulated environments. We also provide a dataset of Scannet object crops and captions generated using our pipeline.
- [2025-03] We release SORT3D for offline grounding and online object-centric navigation.
- Repository Structure
- Data
- System Requirements
- SORT3D-Bench: Setup
- SORT3D-Bench: Usage
- SORT3D-Nav: Setup
- 0) Cloning Repo and Recommended Installation Method
- 1) Docker Installation (Recommended)
- 2) Pulling and Preparing Docker Image
- 3a) Building ROS Humble System with Wheelchair Simulator
- 3b) Building ROS Noetic System with Wheelchair Simulator (Ubuntu 22.04)
- 3c) Building ROS Humble System with Mecanum Simulator
- (Optional) Installing ROS Humble System Dependencies Without Docker
- (Optional) Installing ROS Noetic System Dependencies Without Docker
- SORT3D-Nav: Usage
- Troubleshooting
- Citation
SORT3D has two major versions:
- SORT3D-Bench: The version of SORT3D used to run the ReferIt3D and the IRef-VLA benchmarks.
- SORT3D-Nav: The version of SORT3D used to run navigation on our robot platforms, built on top of our base autonomy stack. SORT3D is deployed on two research platforms:
- Our wheelchair-base robot (wheelchair), for which we have both ROS Noetic and ROS Humble versions.
- Our mecanum-wheeled robot (mecanum), for which we have a ROS Humble version.
Platform | ROS Version | Branch | Simulation Available | Live Demo Available (Using ROS Bag) | Ground Truth Semantics Available | Semantic Mapping Module Available |
---|---|---|---|---|---|---|
Benchmark | - | humble-wheelchair |
☑️ | - | ☑️ | - |
Wheelchair | Noetic | humble-wheelchair |
☑️ | ☑️ | ☑️ | ☑️ |
Wheelchair | Humble | noetic-wheelchair |
☑️ | ☑️ | ☑️ | ☑️ |
Mecanum | Humble | humble-mecanum |
☑️ | ☑️ | ☑️ | ☑️ |
To run SORT3D-Bench, ensure the following three datasets are downloaded and unzipped:
-
Object Captions Dataset: For our benchmark, we have pregenerated 2D object crops and captions using our captioning system and Qwen2.5-VL. To download, first install minio and tqdm:
pip install minio tqdm
Then run
python data/download_crops_dataset.py --download_path data
The data will be downloaded as a zip file in
data/
. Unzip the file directly intodata
, the path to the unzipped folder should bedata/captions
. -
IRef-VLA Scannet: We use the processed pointclouds in IRef-VLA for our benchmark. Follow the instructions in the repo and download only the Scannet subset of the data:
python download_dataset.py --download_path data/IRef-VLA --subset scannet
Afterwards, unzip Scannet.zip into
data/IRef-VLA
. The folder structure should bedata/IRef-VLA/Scannet
. -
ReferIt3D: We provide the subsets of ReferIt3D used for the benchmark in
data/referit3d
.
Extract the IRef-VLA and the captions data into the same folder. The final folder structure should look like so:
data/
IRef-VLA/
Scannet/
scene0000_00
instance_crops
scene0000_00_free_space_pc_result.ply
scene0000_00_...
scene0000_01
instance_crops
scene0000_00_free_space_pc_result.ply
scene0000_00_...
...
referit3d/
We provide ROS bag files for both the wheelchair and mecanum platforms. To download, install minio and tqdm:
pip install minio tqdm
Then run
python data/download_rosbag.py -
8000
-download_path bagfiles --platform [wheelchair|mecanum]
while making sure to pick the correct platform. Each ROS bag will be downloaded as a zip file in bagfiles/
. Unzip the bag files into your directory of choice before replaying them. The wheelchair bag file is currently available, with the mecanum-wheeled robot bag file upcoming with the release of the mecanum version of SORT3D-Nav.
SORT3D-Nav has been deployed on an Nvidia RTX 4090 with 24GB of VRAM to run the live captioning model on the wheelchair, and on an Nvidia RTX 4090 with 16GB of VRAM to run the live captioning model on the mecanum-wheeled robot. The system requires a minimum of:
- 10GB of VRAM to run the semantic mapping module along with live captioning.
- 7GB of VRAM to run using ground truth semantics with live captioning.
If you have more VRAM, you may increase the captioner_batch_size
in the run scripts to get faster captioning throughput (and vice versa).
The language planner additionally requires a WiFi connection on the robot to connect to the Mistral servers. This system has been tested in Ubuntu 20.04, 22.04, and 24.04, running in the Ubuntu 22.04 Docker image we provide.
First, make sure you are checked out into humble-wheelchair
:
git checkout humble-wheelchair
We provide a conda environment containing all the the dependencies required for SORT3D-Bench, which does not require ROS. Create the conda environment like so.
conda env create -f environment.yml -n sort3d
A requirements.txt
is also provided mirroring the pip requirements in the environment.yml
. The Docker image contains all the requirements for SORT3D-Bench preinstalled as well. You may follow sections 1-2 in Setup: SORT3D-Nav to install Docker and set the image up.
Build the docker:
docker build --network=host -t sort3d:latest -f docker/Dockerfile_benchmark .
Run the docker:
docker run --gpus all -it --rm -v [CODE_PATH]:/home/sort3d/SORT3D sort3d:latest
Follow the instructions in Dataset For SORT3D-Bench to ensure the dataset is correctly set up.
SORT3D uses Mistral Large 2 by default. Create a free research API key, then set the environment variable MISTRAL_API_KEY
:
export MISTRAL_API_KEY="YOUR API KEY HERE"
You may then run the benchmark on either Nr3D or Sr3D:
cd ai_module/src/language_planner/language_planner
conda activate sort3d # you can skip this if using a docker
python3 language_planner_benchmark.py --dataset [nr3d|sr3d] --log_dir [LOGFOLDER]
Choose nr3d
or sr3d
as the --dataset
argument to run the benchmark on our subsets of Nr3D and Sr3D respectively. The benchmark results are logged in ai_module/src/language_planner/language_planner/logs/exp###
by default (where ### starts at 000 and is automatically incremented with each run). The script logs all correct answers and LLM reasoning in correct.json
, and all incorrect answers in incorrect.json
.
The script takes a set of optional arguments. The fully supported ones for this release are tabulated below:
Argument | Supported Values | Description |
---|---|---|
--exp_name |
Any string | Give the current experiment an optional name. Default is exp###, where ### is an automatically assigned number. |
--model |
mistral - gpt-4o |
Use a different LLM for grounding. Default is Mistral, and we have tested GPT-4o in our paper; other models included in our code may be buggy. For OpenAI, provide the API key in the OPENAI_API_KEY environment variable. |
Begin by cloning the repo with its submodules in your home directory:
cd ~
git clone https://github.com/nzantout/SORT3D.git --recursive
We provide a CUDA-enabled Ubuntu 22.04 Docker image with both ROS Noetic (built from source) and ROS Humble preinstalled. This is the recommended way to run SORT3D, as ROS and all dependencies are preinstalled in the docker image. Follow sections 1 through 3 to install Docker on your computer, pull the image, and download simulation files. The user home directory, /home/$USER
, is mounted as a volume in the Docker image, allowing access to the repo from the Docker image if the repo has been cloned within the home directory. We provide optional instructions to install the system on a base Ubuntu 22.04 system for both ROS Humble and ROS Noetic.
Install Docker and grant user permission.
curl https://get.docker.com | sh && sudo systemctl --now enable docker
sudo usermod -aG docker ${USER}
Make sure to restart the computer, then install Nvidia Container Toolkit (Nvidia GPU Driver should be installed already).
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor \
-o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install nvidia-container-toolkit
Configure Docker runtime and restart Docker daemon.
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Test if the installation is successful. You should see something like below.
docker run --gpus all --rm nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Sat Dec 16 17:27:17 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 24% 50C P0 40W / 200W | 918MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Allow remote X connections.
xhost +
Pull the Docker image and build the container:
cd docker
docker compose -f compose_gpu.yml up --build -d
To run without rebuilding:
docker compose -f compose_gpu.yml up -d
You may then access the running container.
docker exec -it ubuntu22_ros bash
Make sure you are checked out into humble-wheelchair
:
git checkout humble-wheelchair
The instructions for building the base system are excerpted from its original repo. Start by making sure ROS Humble is sourced:
source /opt/ros/humble/setup.bash
Then build the base autonomy system in simulator/wheelchair_unity
:
cd simulator/wheelchair_unity
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release
Download any of our Unity environment models (the models are configured for ROS2, not compatible with ROS1) and unzip the files to the 'src/vehicle_simulator/mesh/unity' folder. The environment model files should look like below. Note that the 'AssetList.csv' file is generated upon start of the system.
mesh/
unity/
environment/
Model_Data/ (multiple files in the folder)
Model.x86_64
UnityPlayer.so
AssetList.csv (generated at runtime)
Dimensions.csv
Categories.csv
map.ply
object_list.txt
traversable_area.ply
map.jpg
render.jpg
Build SORT3D-Nav in ai_module
:
cd ../../ai_module
colcon build --symlink-install
Afterwards, install the following dependencies in the semantic mapping module. If you are using our provided Docker image, all the other dependencies in the repositories are preinstalled, and you only need to install these. Otherwise, if you want to use the module outside the image, follow the instructions in the repo README.
cd ../semantic_mapper/external
pip install Grounded-SAM-2/grounding_dino
pip install Grounded-SAM-2
pip install byte_track cython_bbox
Make sure you are checked out into noetic-wheelchair
:
git checkout noetic-wheelchair
The instructions for building the base system are excerpted from its original repo. Since SORT3D requires Python > 3.9 to work, ROS Noetic cannot be used on its default 20.04, and must be built from source on Ubuntu 22.04. Instructions to build ROS Noetic on Ubuntu 22.04 from source are in this section, and ROS Noetic is already prebuilt in the provided Docker image. The base autonomy system requires extra ROS dependencies which we have modified to compile on Ubuntu 22.04, found in simulator/noetic_ubuntu22_extra_deps
. These dependencies must be built first, then the workspace overlaid by sourcing it before building the simulator workspace:
source /opt/ros/noetic/setup.bash
cd simulator/noetic_ubuntu22_extra_deps
catkin_make
source devel/setup.bash
cd ../wheelchair_unity
catkin_make
Download any of our Unity environment models (the models are configured for ROS1, not compatible with ROS2) and unzip the files to the 'src/vehicle_simulator/mesh/unity' folder. The environment model files should look like below. Note that the 'AssetList.csv' file is generated upon start of the system.
mesh/
unity/
8000
environment/
Model_Data/ (multiple files in the folder)
Model.x86_64
UnityPlayer.so
AssetList.csv (generated at runtime)
Dimensions.csv
Categories.csv
map.ply
object_list.txt
traversable_area.ply
map.jpg
render.jpg
Build SORT3D-Nav in ai_module
:
cd ../../ai_module
catkin_make
Afterwards, install the following dependencies in the semantic mapping module. If you are using our provided Docker image, all the other dependencies in the repositories are preinstalled, and you only need to install these. Otherwise, if you want to use the module outside the image, follow the instructions in the repo README.
cd ../semantic_mapper/external
pip install Grounded-SAM-2/grounding_dino
pip install Grounded-SAM-2
pip install byte_track cython_bbox
Make sure you are checked out into humble-mecanum
:
git checkout humble-mecanum
The instructions for building the base system are excerpted from its original repo. Start by making sure ROS Humble is sourced:
source /opt/ros/humble/setup.bash
Then build the base autonomy system in simulator/mecanum_unity
, skipping the SLAM module and Mid-360 lidar driver (the two packages are not needed for simulation):
cd simulator/mecanum_unity
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release --packages-skip arise_slam_mid360 arise_slam_mid360_msgs livox_ros_driver2
Download a Unity environment model for the Mecanum wheel platform and unzip the files to the 'src/base_autonomy/vehicle_simulator/mesh/unity' folder. The environment model files should look like below.
mesh/
unity/
environment/
Model_Data/ (multiple files in the folder)
Model.x86_64
UnityPlayer.so
AssetList.csv (generated at runtime)
Dimensions.csv
Categories.csv
map.ply
object_list.txt
traversable_area.ply
map.jpg
render.jpg
Build SORT3D-Nav in ai_module
:
cd ../../ai_module
colcon build --symlink-install
Afterwards, install the following dependencies in the semantic mapping module. If you are using our provided Docker image, all the other dependencies in the repositories are preinstalled, and you only need to install these. Otherwise, if you want to use the module outside the image, follow the instructions in the repo README.
cd ../semantic_mapper/external
pip install Grounded-SAM-2/grounding_dino
pip install Grounded-SAM-2
pip install byte_track cython_bbox
This section contains instructions to install ROS Humble and SORT3D-Nav system dependencies on a base Ubuntu 22.04 system. Please report any issues to the issue tracker.
- Begin by installing ros-humble-desktop, following the ROS wiki page.
- Install CUDA Toolkit 12.x following the instructions on the official website. This system has been tested with CUDA 12.1, but should work with higher CUDA versions.
- Install ROS Humble dependencies for the base autonomy system:
sudo apt update sudo apt install libusb-dev ros-humble-perception-pcl ros-humble-sensor-msgs-py ros-humble-tf-transformations ros-humble-joy python3-colcon-common-extensions python-is-python3 pip install transforms3d pyyaml
- Install the pip dependencies for SORT3D-Nav. Make sure you are in this repo's top level directory:
pip install -r requirements.txt
- Follow Section 3a or Section 3c to set up the system.
This section contains instructions to build ROS Noetic from source and SORT3D-Nav system dependencies on a base Ubuntu 22.04 system. Please report any issues to the issue tracker.
- As ROS Noetic does not support Ubuntu 22.04, it must be built from source. Follow the instructions in this Reddit post, mirrored in this repository.
- Install CUDA Toolkit 12.x following the instructions on the official website. This system has been tested with CUDA 12.1, but should work with higher CUDA versions.
- Install ROS Noetic dependencies for the base autonomy system:
sudo apt update sudo apt install libusb-dev python-yaml python-is-python3
- Install the pip dependencies for SORT3D-Nav. Make sure you are in this repo's top level directory:
pip install -r requirements.txt
- Follow Section 3b to set up the system.
The instructions for running the simulated system using ground truth semantics are the same regardless of which branch you are using. Check out the branch you wish to run.
SORT3D uses Mistral Large 2 by default. Create a free research API key, then replace the placeholder in scripts/run_full_system_gt_semantics.sh
with your API key:
export MISTRAL_API_KEY="YOUR API KEY HERE"
You may do the same with scripts/run_sort3d_navigation_gt_semantics.sh
if you want to run SORT3D separately from the base autonomy system. Make sure all the scripts are executable:
chmod -R +x scripts
Then, in one terminal, run
scripts/run_full_system_gt_semantics.sh
Wait until the system starts up. You should see the RViz and Unity windows open:
In your terminal, the captioning and language planner nodes will be logging to standard output:
In another terminal, run the query publisher node to take in from standard input:
scripts/run_query_publisher.sh
The output of the query publisher node should look like so:
[INFO] [1743652602.832061201] [language_publisher]: LanguagePublisher node has been started. Type your query below.
Enter a query to publish:
You may then type a natural language navigation statement, like "go near the red chair", and watch the system navigate:
sort3d_gt_semantics_demo.mp4
The instructions for running the simulated system with the semantic mapping module are the same regardless of which branch you are using. Check out the branch you wish to run.
Create a free research API key for Mistral Large 2, then replace the placeholder in scripts/run_full_system_semantic_mapping.sh
with your API key:
export MISTRAL_API_KEY="YOUR API KEY HERE"
You may do the same with scripts/run_sort3d_navigation_semantic_mapping.sh
if you want to run SORT3D separately from the base autonomy system. Make sure all the scripts are executable:
chmod -R +x scripts
Then, in one terminal, run
scripts/run_full_system_semantic_mapping.sh
Wait until the system starts up. You should see the RViz and Unity windows open:
In your terminal, the semantic mapping and language planner nodes will be logging to standard output:
In another terminal, run the query publisher node to take in from standard input:
scripts/run_query_publisher.sh
The output of the query publisher node should look like so:
[INFO] [1743652602.832061201] [language_publisher]: LanguagePublisher node has been started. Type your query below.
Enter a query to publish:
Afterwards, drive the robot around to create a semantic map of the scene scene either using the virtual joystick or by clicking the "Waypoint with Heading" button and supplying waypoints:
semantic_mapping.mp4
You may then type a natural language navigation statement, like "go to the potted plant furthest from you", and watch the system navigate:
sort3d_semantic_mapping_demo.mp4
We provide ROS bags of various indoor environments to demonstrate SORT3D-Nav in real environments. Follow the instructions above to download a ROS bag for either the mecanum-wheeled robot or the wheelchair-base robot. Again, make sure you have created a free research API key for Mistral Large 2, then replace the placeholder in scripts/run_sort3d_navigation_semantic_mapping.sh
with your API key:
export MISTRAL_API_KEY="YOUR API KEY HERE"
Start by running the script for SORT3D-Nav using semantic mapping (script is the same regardless of which branch you are using):
scripts/run_sort3d_navigation_semantic_mapping.sh
Run the Rviz viewer in a second terminal:
scripts/run_rviz_viewer.sh
In a third terminal, play the ROS bag you downloaded. If you are using ROS 1:
rosbag play [ros_bag].bag
If you are using ROS 2:
ros2 bag play [ros_bag].db3
Follow the instructions on screen to pause/unpause the bag file. Run the bag for a while to generate a map, and you can see it being generated in the Rviz screen:
rosbag_demo.mp4
To see the target bounding boxes for a query being generated, you may pause the ROS bag, then run the query publisher in a fourth terminal and provide a query:
scripts/run_query_publisher.sh
Please report any issues you face in the issue tracker, and we'll add them here.
If you use our work, please cite:
@misc{zantout2025sort3dspatialobjectcentricreasoning,
title={SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models},
author={Nader Zantout and Haochen Zhang and Pujith Kachana and Jinkai Qiu and Ji Zhang and Wenshan Wang},
year={2025},
eprint={2504.18684},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.18684},
}