8000 GitHub - nzantout/SORT3D: SORT3D, an LLM-based object-centric grounding and indoor navigation system employing a spatial reasoning toolbox and state of the art 2D VLMs for perception.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

SORT3D, an LLM-based object-centric grounding and indoor navigation system employing a spatial reasoning toolbox and state of the art 2D VLMs for perception.

Notifications You must be signed in to change notification settings

nzantout/SORT3D

Repository files navigation

SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models

   

We propose SORT3D, an LLM-based object-centric grounding and indoor navigation system employing a spatial reasoning toolbox and state-of-the-art 2D VLMs for perception. The toolbox is capable of interpreting both direct and indirect statements about spatial relations, using an LLM for high-level reasoning and guiding the autonomous robot to navigate through the environment. It has demonstrated the best zero-shot performance on spatial reasoning benchmarks. To the best of our knowledge, this is the first implementation of a general spatial relation toolbox for autonomous vision-language navigation that is fully integrated into real-robot systems.

 

SORT3D Diagram

 

wheelchair_VLA_with_rviz.mp4

This repository is set up to run both grounding evaluation on the ReferIt3D and VLA-3D benchmarks and online navigation, on both real robots and provided simulated environments. We also provide a dataset of Scannet object crops and captions generated using our pipeline.

Updates

  • [2025-03] We release SORT3D for offline grounding and online object-centric navigation.

Table of Contents

Repository Structure

SORT3D has two major versions:

  1. SORT3D-Bench: The version of SORT3D used to run the ReferIt3D and the IRef-VLA benchmarks.
  2. SORT3D-Nav: The version of SORT3D used to run navigation on our robot platforms, built on top of our base autonomy stack. SORT3D is deployed on two research platforms:
    1. Our wheelchair-base robot (wheelchair), for which we have both ROS Noetic and ROS Humble versions.
    2. Our mecanum-wheeled robot (mecanum), for which we have a ROS Humble version.
       

  This repository contains a separate branch for each platform and each ROS version SORT3D-Nav is deployed on. The SORT3D-Bench script is included in the `humble-wheelchair` branch. Each version of SORT3D-Nav is accompanied with a unity-based simulator and a ROS bag recording of the office areas the live demonstrations were recorded in. Additionally, we provide launch scripts of SORT3D-Nav using both ground truth semantic segmentations and our live semantic mapping module. The table below summarizes the currently available systems and their respective branches:
Platform ROS Version Branch Simulation Available Live Demo Available (Using ROS Bag) Ground Truth Semantics Available Semantic Mapping Module Available
Benchmark - humble-wheelchair ☑️ - ☑️ -
Wheelchair Noetic humble-wheelchair ☑️ ☑️ ☑️ ☑️
Wheelchair Humble noetic-wheelchair ☑️ ☑️ ☑️ ☑️
Mecanum Humble humble-mecanum ☑️ ☑️ ☑️ ☑️

Data

Dataset For SORT3D-Bench

To run SORT3D-Bench, ensure the following three datasets are downloaded and unzipped:

  1. Object Captions Dataset: For our benchmark, we have pregenerated 2D object crops and captions using our captioning system and Qwen2.5-VL. To download, first install minio and tqdm:

    pip install minio tqdm

    Then run

    python data/download_crops_dataset.py --download_path data

    The data will be downloaded as a zip file in data/. Unzip the file directly into data, the path to the unzipped folder should be data/captions.

  2. IRef-VLA Scannet: We use the processed pointclouds in IRef-VLA for our benchmark. Follow the instructions in the repo and download only the Scannet subset of the data:

    python download_dataset.py --download_path data/IRef-VLA --subset scannet

    Afterwards, unzip Scannet.zip into data/IRef-VLA. The folder structure should be data/IRef-VLA/Scannet.

  3. ReferIt3D: We provide the subsets of ReferIt3D used for the benchmark in data/referit3d.

Extract the IRef-VLA and the captions data into the same folder. The final folder structure should look like so:

data/
    IRef-VLA/
        Scannet/
            scene0000_00
                 instance_crops
                 scene0000_00_free_space_pc_result.ply
                 scene0000_00_...
            scene0000_01
                 instance_crops
                 scene0000_00_free_space_pc_result.ply
                 scene0000_00_...
            ...
    referit3d/

ROS Bag Files for SORT3D-Nav

We provide ROS bag files for both the wheelchair and mecanum platforms. To download, install minio and tqdm:

pip install minio tqdm

Then run

python data/download_rosbag.py -
8000
-download_path bagfiles --platform [wheelchair|mecanum]

while making sure to pick the correct platform. Each ROS bag will be downloaded as a zip file in bagfiles/. Unzip the bag files into your directory of choice before replaying them. The wheelchair bag file is currently available, with the mecanum-wheeled robot bag file upcoming with the release of the mecanum version of SORT3D-Nav.

System Requirements

Hardware Requirements

SORT3D-Nav has been deployed on an Nvidia RTX 4090 with 24GB of VRAM to run the live captioning model on the wheelchair, and on an Nvidia RTX 4090 with 16GB of VRAM to run the live captioning model on the mecanum-wheeled robot. The system requires a minimum of:

  • 10GB of VRAM to run the semantic mapping module along with live captioning.
  • 7GB of VRAM to run using ground truth semantics with live captioning.

If you have more VRAM, you may increase the captioner_batch_size in the run scripts to get faster captioning throughput (and vice versa).

The language planner additionally requires a WiFi connection on the robot to connect to the Mistral servers. This system has been tested in Ubuntu 20.04, 22.04, and 24.04, running in the Ubuntu 22.04 Docker image we provide.

SORT3D-Bench: Setup

1.1) Conda Environment

First, make sure you are checked out into humble-wheelchair:

git checkout humble-wheelchair

We provide a conda environment containing all the the dependencies required for SORT3D-Bench, which does not require ROS. Create the conda environment like so.

conda env create -f environment.yml -n sort3d

A requirements.txt is also provided mirroring the pip requirements in the environment.yml. The Docker image contains all the requirements for SORT3D-Bench preinstalled as well. You may follow sections 1-2 in Setup: SORT3D-Nav to install Docker and set the image up.

1.2) Use Docker (Alternatively)

Build the docker:

docker build --network=host -t sort3d:latest -f docker/Dockerfile_benchmark .

Run the docker:

docker run --gpus all -it --rm -v [CODE_PATH]:/home/sort3d/SORT3D sort3d:latest

2) Dataset Setup

Follow the instructions in Dataset For SORT3D-Bench to ensure the dataset is correctly set up.

SORT3D-Bench: Usage

SORT3D uses Mistral Large 2 by default. Create a free research API key, then set the environment variable MISTRAL_API_KEY:

export MISTRAL_API_KEY="YOUR API KEY HERE"

You may then run the benchmark on either Nr3D or Sr3D:

cd ai_module/src/language_planner/language_planner
conda activate sort3d # you can skip this if using a docker
python3 language_planner_benchmark.py --dataset [nr3d|sr3d] --log_dir [LOGFOLDER]

Choose nr3d or sr3d as the --dataset argument to run the benchmark on our subsets of Nr3D and Sr3D respectively. The benchmark results are logged in ai_module/src/language_planner/language_planner/logs/exp### by default (where ### starts at 000 and is automatically incremented with each run). The script logs all correct answers and LLM reasoning in correct.json, and all incorrect answers in incorrect.json.

The script takes a set of optional arguments. The fully supported ones for this release are tabulated below:

Argument Supported Values Description
--exp_name Any string Give the current experiment an optional name. Default is exp###, where ### is an automatically assigned number.
--model mistral - gpt-4o Use a different LLM for grounding. Default is Mistral, and we have tested GPT-4o in our paper; other models included in our code may be buggy. For OpenAI, provide the API key in the OPENAI_API_KEY environment variable.

SORT3D-Nav: Setup

0) Cloning Repo and Recommended Installation Method

Begin by cloning the repo with its submodules in your home directory:

cd ~
git clone https://github.com/nzantout/SORT3D.git --recursive

We provide a CUDA-enabled Ubuntu 22.04 Docker image with both ROS Noetic (built from source) and ROS Humble preinstalled. This is the recommended way to run SORT3D, as ROS and all dependencies are preinstalled in the docker image. Follow sections 1 through 3 to install Docker on your computer, pull the image, and download simulation files. The user home directory, /home/$USER, is mounted as a volume in the Docker image, allowing access to the repo from the Docker image if the repo has been cloned within the home directory. We provide optional instructions to install the system on a base Ubuntu 22.04 system for both ROS Humble and ROS Noetic.

1) Docker Installation (Recommended)

Install Docker and grant user permission.

curl https://get.docker.com | sh && sudo systemctl --now enable docker
sudo usermod -aG docker ${USER}

Make sure to restart the computer, then install Nvidia Container Toolkit (Nvidia GPU Driver should be installed already).

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor \
  -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install nvidia-container-toolkit

Configure Docker runtime and restart Docker daemon.

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Test if the installation is successful. You should see something like below.

docker run --gpus all --rm nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Sat Dec 16 17:27:17 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 24%   50C    P0    40W / 200W |    918MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

2) Pulling and Preparing Docker Image

Allow remote X connections.

xhost +

Pull the Docker image and build the container:

cd docker
docker compose -f compose_gpu.yml up --build -d

To run without rebuilding:

docker compose -f compose_gpu.yml up -d

You may then access the running container.

docker exec -it ubuntu22_ros bash

3a) Building ROS Humble System with Wheelchair Simulator

Make sure you are checked out into humble-wheelchair:

git checkout humble-wheelchair

The instructions for building the base system are excerpted from its original repo. Start by making sure ROS Humble is sourced:

source /opt/ros/humble/setup.bash

Then build the base autonomy system in simulator/wheelchair_unity:

cd simulator/wheelchair_unity
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release

Download any of our Unity environment models (the models are configured for ROS2, not compatible with ROS1) and unzip the files to the 'src/vehicle_simulator/mesh/unity' folder. The environment model files should look like below. Note that the 'AssetList.csv' file is generated upon start of the system.

mesh/
    unity/
        environment/
            Model_Data/ (multiple files in the folder)
            Model.x86_64
            UnityPlayer.so
            AssetList.csv (generated at runtime)
            Dimensions.csv
            Categories.csv
        map.ply
        object_list.txt
        traversable_area.ply
        map.jpg
        render.jpg

Build SORT3D-Nav in ai_module:

cd ../../ai_module
colcon build --symlink-install

Afterwards, install the following dependencies in the semantic mapping module. If you are using our provided Docker image, all the other dependencies in the repositories are preinstalled, and you only need to install these. Otherwise, if you want to use the module outside the image, follow the instructions in the repo README.

cd ../semantic_mapper/external
pip install Grounded-SAM-2/grounding_dino
pip install Grounded-SAM-2
pip install byte_track cython_bbox

3b) Building ROS Noetic System with Wheelchair Simulator (Ubuntu 22.04)

Make sure you are checked out into noetic-wheelchair:

git checkout noetic-wheelchair

The instructions for building the base system are excerpted from its original repo. Since SORT3D requires Python > 3.9 to work, ROS Noetic cannot be used on its default 20.04, and must be built from source on Ubuntu 22.04. Instructions to build ROS Noetic on Ubuntu 22.04 from source are in this section, and ROS Noetic is already prebuilt in the provided Docker image. The base autonomy system requires extra ROS dependencies which we have modified to compile on Ubuntu 22.04, found in simulator/noetic_ubuntu22_extra_deps. These dependencies must be built first, then the workspace overlaid by sourcing it before building the simulator workspace:

source /opt/ros/noetic/setup.bash
cd simulator/noetic_ubuntu22_extra_deps
catkin_make
source devel/setup.bash
cd ../wheelchair_unity
catkin_make

Download any of our Unity environment models (the models are configured for ROS1, not compatible with ROS2) and unzip the files to the 'src/vehicle_simulator/mesh/unity' folder. The environment model files should look like below. Note that the 'AssetList.csv' file is generated upon start of the system.

mesh/
    unity/
8000         environment/
            Model_Data/ (multiple files in the folder)
            Model.x86_64
            UnityPlayer.so
            AssetList.csv (generated at runtime)
            Dimensions.csv
            Categories.csv
        map.ply
        object_list.txt
        traversable_area.ply
        map.jpg
        render.jpg

Build SORT3D-Nav in ai_module:

cd ../../ai_module
catkin_make

Afterwards, install the following dependencies in the semantic mapping module. If you are using our provided Docker image, all the other dependencies in the repositories are preinstalled, and you only need to install these. Otherwise, if you want to use the module outside the image, follow the instructions in the repo README.

cd ../semantic_mapper/external
pip install Grounded-SAM-2/grounding_dino
pip install Grounded-SAM-2
pip install byte_track cython_bbox

3c) Building ROS Humble System with Mecanum Simulator

Make sure you are checked out into humble-mecanum:

git checkout humble-mecanum

The instructions for building the base system are excerpted from its original repo. Start by making sure ROS Humble is sourced:

source /opt/ros/humble/setup.bash

Then build the base autonomy system in simulator/mecanum_unity, skipping the SLAM module and Mid-360 lidar driver (the two packages are not needed for simulation):

cd simulator/mecanum_unity
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release --packages-skip arise_slam_mid360 arise_slam_mid360_msgs livox_ros_driver2

Download a Unity environment model for the Mecanum wheel platform and unzip the files to the 'src/base_autonomy/vehicle_simulator/mesh/unity' folder. The environment model files should look like below.

mesh/
    unity/
        environment/
            Model_Data/ (multiple files in the folder)
            Model.x86_64
            UnityPlayer.so
            AssetList.csv (generated at runtime)
            Dimensions.csv
            Categories.csv
        map.ply
        object_list.txt
        traversable_area.ply
        map.jpg
        render.jpg

Build SORT3D-Nav in ai_module:

cd ../../ai_module
colcon build --symlink-install

Afterwards, install the following dependencies in the semantic mapping module. If you are using our provided Docker image, all the other dependencies in the repositories are preinstalled, and you only need to install these. Otherwise, if you want to use the module outside the image, follow the instructions in the repo README.

cd ../semantic_mapper/external
pip install Grounded-SAM-2/grounding_dino
pip install Grounded-SAM-2
pip install byte_track cython_bbox

(Optional) Installing ROS Humble System Dependencies without Docker

This section contains instructions to install ROS Humble and SORT3D-Nav system dependencies on a base Ubuntu 22.04 system. Please report any issues to the issue tracker.

  1. Begin by installing ros-humble-desktop, following the ROS wiki page.
  2. Install CUDA Toolkit 12.x following the instructions on the official website. This system has been tested with CUDA 12.1, but should work with higher CUDA versions.
  3. Install ROS Humble dependencies for the base autonomy system:
    sudo apt update
    sudo apt install libusb-dev ros-humble-perception-pcl ros-humble-sensor-msgs-py ros-humble-tf-transformations ros-humble-joy python3-colcon-common-extensions python-is-python3 
    pip install transforms3d pyyaml
  4. Install the pip dependencies for SORT3D-Nav. Make sure you are in this repo's top level directory:
    pip install -r requirements.txt
  5. Follow Section 3a or Section 3c to set up the system.

(Optional) Installing ROS Noetic System Dependencies without Docker

This section contains instructions to build ROS Noetic from source and SORT3D-Nav system dependencies on a base Ubuntu 22.04 system. Please report any issues to the issue tracker.

  1. As ROS Noetic does not support Ubuntu 22.04, it must be built from source. Follow the instructions in this Reddit post, mirrored in this repository.
  2. Install CUDA Toolkit 12.x following the instructions on the official website. This system has been tested with CUDA 12.1, but should work with higher CUDA versions.
  3. Install ROS Noetic dependencies for the base autonomy system:
    sudo apt update
    sudo apt install libusb-dev python-yaml python-is-python3
    
  4. Install the pip dependencies for SORT3D-Nav. Make sure you are in this repo's top level directory:
    pip install -r requirements.txt
  5. Follow Section 3b to set up the system.

SORT3D-Nav: Usage

Simulation with Ground Truth Semantics

The instructions for running the simulated system using ground truth semantics are the same regardless of which branch you are using. Check out the branch you wish to run.

SORT3D uses Mistral Large 2 by default. Create a free research API key, then replace the placeholder in scripts/run_full_system_gt_semantics.sh with your API key:

export MISTRAL_API_KEY="YOUR API KEY HERE"

You may do the same with scripts/run_sort3d_navigation_gt_semantics.sh if you want to run SORT3D separately from the base autonomy system. Make sure all the scripts are executable:

chmod -R +x scripts 

Then, in one terminal, run

scripts/run_full_system_gt_semantics.sh

Wait until the system starts up. You should see the RViz and Unity windows open:

RViz window Unity window

In your terminal, the captioning and language planner nodes will be logging to standard output:

In another terminal, run the query publisher node to take in from standard input:

scripts/run_query_publisher.sh

The output of the query publisher node should look like so:

[INFO] [1743652602.832061201] [language_publisher]: LanguagePublisher node has been started. Type your query below.
Enter a query to publish: 

You may then type a natural language navigation statement, like "go near the red chair", and watch the system navigate:

sort3d_gt_semantics_demo.mp4

Simulation with Semantic Mapping Module

The instructions for running the simulated system with the semantic mapping module are the same regardless of which branch you are using. Check out the branch you wish to run.

Create a free research API key for Mistral Large 2, then replace the placeholder in scripts/run_full_system_semantic_mapping.sh with your API key:

export MISTRAL_API_KEY="YOUR API KEY HERE"

You may do the same with scripts/run_sort3d_navigation_semantic_mapping.sh if you want to run SORT3D separately from the base autonomy system. Make sure all the scripts are executable:

chmod -R +x scripts 

Then, in one terminal, run

scripts/run_full_system_semantic_mapping.sh

Wait until the system starts up. You should see the RViz and Unity windows open:

RViz window Unity window

In your terminal, the semantic mapping and language planner nodes will be logging to standard output:

In another terminal, run the query publisher node to take in from standard input:

scripts/run_query_publisher.sh

The output of the query publisher node should look like so:

[INFO] [1743652602.832061201] [language_publisher]: LanguagePublisher node has been started. Type your query below.
Enter a query to publish: 

Afterwards, drive the robot around to create a semantic map of the scene scene either using the virtual joystick or by clicking the "Waypoint with Heading" button and supplying waypoints:

semantic_mapping.mp4

You may then type a natural language navigation statement, like "go to the potted plant furthest from you", and watch the system navigate:

sort3d_semantic_mapping_demo.mp4

ROS Bag

We provide ROS bags of various indoor environments to demonstrate SORT3D-Nav in real environments. Follow the instructions above to download a ROS bag for either the mecanum-wheeled robot or the wheelchair-base robot. Again, make sure you have created a free research API key for Mistral Large 2, then replace the placeholder in scripts/run_sort3d_navigation_semantic_mapping.sh with your API key:

export MISTRAL_API_KEY="YOUR API KEY HERE"

Start by running the script for SORT3D-Nav using semantic mapping (script is the same regardless of which branch you are using):

scripts/run_sort3d_navigation_semantic_mapping.sh

Run the Rviz viewer in a second terminal:

scripts/run_rviz_viewer.sh

In a third terminal, play the ROS bag you downloaded. If you are using ROS 1:

rosbag play [ros_bag].bag

If you are using ROS 2:

ros2 bag play [ros_bag].db3

Follow the instructions on screen to pause/unpause the bag file. Run the bag for a while to generate a map, and you can see it being generated in the Rviz screen:

rosbag_demo.mp4

To see the target bounding boxes for a query being generated, you may pause the ROS bag, then run the query publisher in a fourth terminal and provide a query:

scripts/run_query_publisher.sh

Troubleshooting

Please report any issues you face in the issue tracker, and we'll add them here.

Citation

If you use our work, please cite:

@misc{zantout2025sort3dspatialobjectcentricreasoning,
      title={SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models}, 
      author={Nader Zantout and Haochen Zhang and Pujith Kachana and Jinkai Qiu and Ji Zhang and Wenshan Wang},
      year={2025},
      eprint={2504.18684},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.18684}, 
}

About

SORT3D, an LLM-based object-centric grounding and indoor navigation system employing a spatial reasoning toolbox and state of the art 2D VLMs for perception.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0