8000 GitHub - behretj/LostFound: [RA-L] Lost & Found dynamically tracks object poses from egocentric videos while updating a scene graph, enabling richer semantic 3D understanding for robotic downstream tasks.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[RA-L] Lost & Found dynamically tracks object poses from egocentric videos while updating a scene graph, enabling richer semantic 3D understanding for robotic downstream tasks.

License

Notifications You must be signed in to change notification settings

behretj/LostFound

Repository files navigation

Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs

1ETH Zürich, 2Microsoft, 3Uni Bonn    *Equal supervision

teaser

Abstract

Recent approaches have successfully focused on the segmentation of static reconstructions, thereby equipping downstream applications with semantic 3D understanding. However, the world in which we live is dynamic, characterized by numerous interactions between the environment and humans or robotic agents. Static semantic maps are unable to capture this information, and the naive solution of rescanning the environment after every change is both costly and ineffective in tracking e.g. objects being stored away in drawers. With Lost & Found we present an approach that addresses this limitation. Based solely on egocentric recordings with corresponding hand position and camera pose estimates, we are able to track the 6DoF poses of the moving object within the detected interaction interval. These changes are applied online to a transformable scene graph that captures object-level relations. Compared to state-of-the-art object pose trackers, our approach is more reliable in handling the challenging egocentric viewpoint and the lack of depth information. It outperforms the second-best approach by 34% and 56% for translational and orientational error, respectively, and produces visibly smoother 6DoF object trajectories. In addition, we illustrate how the acquired interaction information in the dynamic scene graph can be employed in the context of robotic applications that would otherwise be unfeasible: We show how our method allows to command a mobile manipulator through teach & repeat, and how information about prior interaction allows a mobile manipulator to retrieve an object hidden in a drawer.

[Project Webpage] [Paper] [Teaser Video]

News 📰

  • April 22nd: Our paper has been accepted as a 4-page abstract to the Workshop on Computer Vision for Mixed Reality that is held in conjunction with CVPR 2025! More information here.
  • March 5th: We published the evaluation dataset. Have a look on Zenodo in order to reproduce our results or run your own pipeline.
  • Febuary 4th: Our paper has been accepted to the IEEE Robotics and Automation Letters (RA-L)!! Check it out here

Environment Setup 📝

  • Setup conda environment
# create conda environment
conda create -n lost_found -c conda-forge python=3.10.12

# activate conda environment
conda activate lost_found

# install PyTorch for your respective architecture, tested with CUDA 11.7:
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia

# install and build dependencies for co-tracker
cd thirdparty/cotracker && pip install -e . && cd ../..

# install and build dependencies for hand-object-detector
cd thirdparty/detector/lib && python setup.py build develop && cd ../../..

# Install remaining dependencies in main repository
pip install -r requirements.txt

If problems arise with the thirdparty modules on your machine, have a look at the respective git repositories for more detailed installation guides: co-tracker and hand-object-detector.

Downloads 💧

  1. The pre-trained model weights for the hand-object detection are available here and place them under the folder thirdparty/detector/models/res101_handobj_100K/pascal_voc:

    mkdir -p thirdparty/detector/models/res101_handobj_100K/pascal_voc
    cd thirdparty/detector/models/res101_handobj_100K/pascal_voc
    gdown https://drive.google.com/uc?id=1H2tWsZkS7tDF8q1-jdjx6V9XrK25EDbE
    cd ../../../../..
  2. The pre-trained CoTracker2 weights for the online version is available here and place them under the folder thirdparty/cotracker/checkpoint:

    mkdir thirdparty/cotracker/checkpoint
    cd thirdparty/cotracker/checkpoint
    wget https://huggingface.co/facebook/cotracker/resolve/main/cotracker2.pth
    cd ../../..
  3. [Optional] Download demo data for a shoe scene and extract them under the folder demo_data/ as well as the respective 3D scan for the demo under the folder Scan/.

  4. [Optional] Download the full Evaluation Dataset. Extract the Dataset and the 3D Scan into the Data folder.

  5. [Optional] There is an easy docker setup available for the YOLO drawer detection algorithm. Simply pull the docker image from the hub (docker pull craiden/yolodrawer:v1.0). Start the container (docker run -p 5004:5004 --gpus all -it craiden/yolodrawer:v1.0) and run the module (python3 app.py). You need to activate the respective flag for drawer detection in the preprocess_scan function and in the build of the respective scene graph as mentioned in the demo section below. The functional elements for light switches are included in this repo as well. For the setup of the detection module, we refer to this work. They also greatly demonstrate how robotic agents profit from the proposed scene graph structure in the case of light switches.

Run demo

If you have not downloaded the demo data yet as well as the detecion modules, do so as described in the section above. The file run_demo.py consists of an example that steps through the different possibilities when creating a scene graph for ttat tracking sequence. Fill in the respective directories for the variables SCAN_DIR and ARIA_DATA at the beginning of the file.

In the preprocess_scan, we have the option whether we want to run an additional drawer- or light switch-detection algorithm on the scan. If we have done so, we can integrate those detections into the scene graph within its build function.

When creating the scene graph, we have the possibility to set a minimum confidence threshold for objects that should be added to the graph as well as a list of objects that we would like to mark as immovable throughout the tracking. The remove_category function proves useful when you want to get rid off certain object categories for better visualization. To actually visualize the graph, it's sufficient to call the corresponding visualize() function. The flags centroid, connections and labels toggle the visibility of these within the scene graph. For tracking, one can choose to create a video of the sequence by providing a corresponding path.

python run_demo.py 

Evaluation/Dataset

In this section, we report the results of our paper. To reproduce the results, download the dataset as mention above and place it inside an appropriate location. With the two commands below, you are able to generate the 6DoF trajectories with corresponding timesteps for (i) the Head Pose baseline and (ii) Lost & Found:

# (i): Head Pose
python run_dataset.py --scan_dir Data/Final_Scan --data_dir Data/Final_Dataset --headpose --save_pose

# (ii): Lost & Found
python run_dataset.py --scan_dir Data/Final_Scan --data_dir Data/Final_Dataset --save_pose

We used custom implementations of BundleTrack, BundleSDF and FoundationPose for baseline comparison, in a sense that we introduced Metric3Dv2 for depth and SAM2 for mask generation. Please refer to the code bases for more detailed information.

We state the main findings of our approach compared to the baselines below. For more, information please refer to the actual paper.

Results table

Run Pipeline on your own data

This setup requires access to Aria glasses as part of the Aria Research Kit (ARK).

3D Scan

To run the pipeline on your own data, you require an initial 3D scan. In the scope of this project, we use the 3D Scanner App on iOS to capture the scene. Make sure the fiducial is visible during the scan for initialization. Once the scan is complete, click on Share and export two things:

  1. All Data contains a 3D mesh of the scene as well as all RGB images along with the camera poses for reconstruction.
  2. Point Cloud/PLY with the High Density setting enabled (keep the default of Z axis up disabled!).

Rename the point cloud to mesh.ply and place it within the extracted All Data folder and give the folder a meaningful name. The folder should have the following structure:

Scan/
├── annotations.json
├── export.obj
├── export_refined.obj
├── frame_00000.jpg
├── frame_00000.json
├── ...
├── info.json
├── mesh.ply
├── textured_output.jpg
├── textured_output.mtl
├── textured_output.obj
├── thumb_00000.jpg                
└── world_map.arkit

Mask3D

An initial semantic instance segmentation of the scene is required. As explained in the paper, we stick to Mask3D for this purpose. The setup is involved, hence, we cleanly separate it from the main project and use a provided docker container for easy installation (Credits to Rupal Saxena). This is reasonable as we need to perform the instance segmentation only once per scene. For setup without docker, refer to the original repository.

# clone this forked repository on your machine to a location $MASK3D_DIR
git clone https://github.com/behretj/Mask3D.git

# Download the model weights to the checkpoints folder
cd mask3d/checkpoints && wget "https://zenodo.org/records/10422707/files/mask3d_scannet200_demo.ckpt" && cd ..

# pull the docker image
docker pull rupalsaxena/mask3d_docker:latest

# Run the docker container and mount the folder of your Mask3D repository
docker run --gpus all -it -v $HOME:/home -w $MASK3D_DIR/Mask3D rupalsaxena/mask3d_docker:latest

# Inside the docker container, run the following command with the workspace set to the 3D Scan directory ($SCAN) from the previous section
python mask3d.py --seed 42 --workspace $SCAN

# You might face permission issues later on, so allowing general access to the folder at this point is useful:
chmod -R 777 $SCAN

The scan directory consists now of an extra labeled point cloud and corresponding predictions in addition to the prior content:

Scan/
├── pred_mask/
│   ├── 000.txt
│   ├── ...
├── ...         
├── mesh_labeled.ply
└── predictions.txt

Aria Recording

Data should be collected in recording profile 15 with the Aria glasses, capturing RGB videos with 30 FPS at a resolution of 1408x1408 pixels. Similarly to the initial 3D scan, the fidicual marker needs to be visible once during the recording. Each recording produces two files, rename both of them with the same meaningful name. All the recordings from one scene should be placed in the same folder for easier processing.

Data/
├── {recording_1}.vrs
├── {recording_1}.vrs.json
├── ...         
├── {recording_n}.vrs
└── {recording_n}.vrs.json

Calling preprocessing.sh Data/ and entering user name and passsword when prompted requests the Machine Perception Services offered within the Meta API. Make sure that you run the script within the environment you set up for this project. This generates folders with sparse point clouds, device poses and estimated hand positions.

Object tracking and Pose estimation

After setting the correct paths at the top of the file, you can start the tracking process: python run_custom.py. The different visualization options as well as the YOLO extension module work in the same way as explained in the demo section.

License

This project is licensed under the MIT License.

Please note that this repository may include code or components that are licensed under different open-source licenses. The respective licenses and copyrights of these components apply.

We do not offer any warranty or assume liability for the correctness, legality, or suitability of third-party components.

Citing Lost & Found

@ARTICLE{LostFoundBehrens,
    author={Behrens, Tjark and Zurbrügg, René and Pollefeys, Marc and Bauer, Zuria and Blum, Hermann},
    journal={IEEE Robotics and Automation Letters}, 
    title={{Lost & Found}: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs}, 
    year={2025},
    volume={},
    number={},
    pages={1-8},
    keywords={Visual Tracking;Mapping;Semantic Scene Understanding},
    doi={10.1109/LRA.2025.3544518}
}

About

[RA-L] Lost & Found dynamically tracks object poses from egocentric videos while updating a scene graph, enabling richer semantic 3D understanding for robotic downstream tasks.

Topics

Resources

License

Stars

Watchers

Forks

0