3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Yash Bhalgat^1* Vadim Tschernezki^1,2* Iro Laina¹ João F. Henriques¹ Andrea Vedaldi¹ Andrew Zisserman¹

¹ Visual Geometry Group, University of Oxford ² NAVER LABS Europe

^* Equal contribution

This repository contains the official implementation of 3D-Aware Instance Segmentation and Tracking in Egocentric Videos.

Our method leverages 3D awareness for robust instance segmentation and tracking in egocentric videos. The approach maintains consistent object identities through occlusions and out-of-view scenarios by integrating scene geometry with instance-level tracking. The figure above shows: (a) input egocentric video frames, (b) DEVA's 2D tracking which loses object identity after occlusion, and (c) our method maintaining consistent tracking through challenging scenarios.

Prerequisites

Before running the code, you'll need to install several external dependencies.

We recommend creating a conda/mamba environment, and then installing other dependencies with the provided requirements.txt.

mamba create -n egoseg3d python=3.8
mamba activate egoseg3d
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Then continue with the installation of following custom dependencies.

Depth Anything: Required for depth estimation
```
git clone https://github.com/LiheYoung/Depth-Anything/tree/1e1c8d373ae6383ef6490a5c2eb5ef29fd085993
```
Copy scripts/preprocessing/depth_anything_EPIC.py to the root of the above repository.
Tracking-Anything-with-DEVA (provided with this repository)
MASA (provided with this repository)

Usage

NOTE: If you want to skip the preprocessing (below) and start with the evaluation instead, you can jump to step 5.

After downloading the EPIC-FIELDS datasets, few preprocessing steps are required before running the tracking pipeline.

First, extract the 3D mesh from the sparse point cloud:

bash scripts_sh/reconstruct_mesh.sh <VID_1> <VID_2> <VID_3> ...

Generate depth maps using Depth Anything:

cd Depth-Anything
python depth_anything_EPIC.py --img-path <images dir> --outdir <output dir>

Extract and align depth maps:

# Extract mesh depth
python scripts/preprocessing/extract_mesh_depth.py --vid=$VID --root $ROOT

# Align depth maps
python scripts/preprocessing/extract_aligned_depth.py --vid=$VID --root $ROOT

2. Instance Segmentation and Feature Extraction

Run DEVA for segmentation:

SFACTOR=5
PID=$(echo $VID | cut -d'_' -f1)
python scripts/deva_baseline.py \
    --img_path $ROOT/mesh/$VID/images \
    --output $ROOT/$PID/$VID/segmaps/deva_OWLv2_s$SFACTOR \
    --amp --temporal_setting semionline --prompt "" \
    --DINO_THRESHOLD 0.4 --detector_type owlv2 \
    --subsample_factor=$SFACTOR --classes=$ROOT/visor/$VID\_classes.pt

Extract DINO features:

python scripts/extract_features_DEVA.py \
    --deva_seg_dir $ROOT/$PID/$VID/segmaps/deva_OWLv2_s$SFACTOR \
    --images_dir $ROOT/mesh/$VID/images \
    --output_dir <output directory> \
    --feature_type dinov2

3. VISOR Annotation Extension

Extend VISOR annotations using DEVA:

python scripts/deva_groundtruth.py \
    --img_path /datasets/EPIC-KITCHENS/$VID/ \
    --output /datasets/EPIC-KITCHENS/$VID/visor_DEVA100_segmaps/ \
    --amp --temporal_setting online \
    --gt_dir $ROOT/$PID/$VID/visor_segmaps/ \
    --max_missed_detection_count 100 \
    --prompt "dummy1.dummy2"

python scripts/preprocessing/postprocess_deva_gt.py --vid $VID

4. 3D-Aware Tracking

Run the main tracking pipeline:

python extract_tracks.py \
    --beta_l=${BETAL} --beta_c=${BETAC} \
    --beta_v=${BETAV} --beta_s=${BETAS} \
    --vid=${VID} \
    --exp=tracked-final-bv${BETAV}-bs${BETAS}-bc${BETAC}-bl${BETAL}

5. Evaluation

To verify the reproducibility of the results, we provide the tracking predictions here. You can download the predictions and evaluate them using the following script.

Evaluate OUR results:

# we provide predictions for following hyperparameters and video
BETAV=2
BETAS=10
BETAC=10000
BETAL=10
VID=P01_104

python scripts/eval_deva.py \
    --segment_type=tracked-final-bv${BETAV}-bs${BETAS}-bc${BETAC}-bl${BETAL} \
    --gt_type=visor_DEVA100_segmaps \
    --vid=${VID}

To evaluate the DEVA baseline, replace the segment_type with deva_OWLv2_s5.

Citation

If you find this work useful, please cite:

@InProceedings{Bhalgat24b,
  author       = "Yash Bhalgat and Vadim Tschernezki and Iro Laina and Joao F. Henriques and Andrea Vedaldi and Andrew Zisserman",
  title        = "3D-Aware Instance Segmentation and Tracking in Egocentric Videos",
  booktitle    = "Asian Conference on Computer Vision",
  year         = "2024",
  organization = "IEEE",
}

Acknowledgments

This work was funded by EPSRC AIMS CDT EP/S024050/1 and AWS (Y. Bhalgat), NAVER LABS Europe (V. Tschernezki), ERC-CoG UNION 101001212 (A. Vedaldi and I. Laina), EPSRC VisualAI EP/T028572/1 (I. Laina, A. Vedaldi and A. Zisserman), and Royal Academy of Engineering RF\201819\18\163 (J. Henriques).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Detic		Detic
Tracking-Anything-with-DEVA		Tracking-Anything-with-DEVA
assets		assets
detectron2		detectron2
masa		masa
scripts		scripts
scripts_sh		scripts_sh
.gitignore		.gitignore
README.md		README.md
commands.txt		commands.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Prerequisites

Usage

NOTE: If you want to skip the preprocessing (below) and start with the evaluation instead, you can jump to step 5.

2. Instance Segmentation and Feature Extraction

3. VISOR Annotation Extension

4. 3D-Aware Tracking

5. Evaluation

Citation

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

yashbhalgat/egoseg3d

Folders and files

Latest commit

History

Repository files navigation

3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Prerequisites

Usage

NOTE: If you want to skip the preprocessing (below) and start with the evaluation instead, you can jump to step 5.

2. Instance Segmentation and Feature Extraction

3. VISOR Annotation Extension

4. 3D-Aware Tracking

5. Evaluation

Citation

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages