Yash Bhalgat1* Vadim Tschernezki1,2* Iro Laina1 João F. Henriques1 Andrea Vedaldi1 Andrew Zisserman1
1 Visual Geometry Group, University of Oxford 2 NAVER LABS Europe
* Equal contribution
This repository contains the official implementation of 3D-Aware Instance Segmentation and Tracking in Egocentric Videos.
Our method leverages 3D awareness for robust instance segmentation and tracking in egocentric videos. The approach maintains consistent object identities through occlusions and out-of-view scenarios by integrating scene geometry with instance-level tracking. The figure above shows: (a) input egocentric video frames, (b) DEVA's 2D tracking which loses object identity after occlusion, and (c) our method maintaining consistent tracking through challenging scenarios.
Before running the code, you'll need to install several external dependencies.
We recommend creating a conda/mamba environment, and then installing other dependencies with the provided requirements.txt.
mamba create -n egoseg3d python=3.8
mamba activate egoseg3d
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
Then continue with the installation of following custom dependencies.
-
Depth Anything: Required for depth estimation
git clone https://github.com/LiheYoung/Depth-Anything/tree/1e1c8d373ae6383ef6490a5c2eb5ef29fd085993
Copy
scripts/preprocessing/depth_anything_EPIC.py
to the root of the above repository. -
Tracking-Anything-with-DEVA (provided with this repository)
-
MASA (provided with this repository)
NOTE: If you want to skip the preprocessing (below) and start with the evaluation instead, you can jump to step 5.
After downloading the EPIC-FIELDS datasets, few preprocessing steps are required before running the tracking pipeline.
First, extract the 3D mesh from the sparse point cloud:
bash scripts_sh/reconstruct_mesh.sh <VID_1> <VID_2> <VID_3> ...
Generate depth maps using Depth Anything:
cd Depth-Anything
python depth_anything_EPIC.py --img-path <images dir> --outdir <output dir>
Extract and align depth maps:
# Extract mesh depth
python scripts/preprocessing/extract_mesh_depth.py --vid=$VID --root $ROOT
# Align depth maps
python scripts/preprocessing/extract_aligned_depth.py --vid=$VID --root $ROOT
Run DEVA for segmentation:
SFACTOR=5
PID=$(echo $VID | cut -d'_' -f1)
python scripts/deva_baseline.py \
--img_path $ROOT/mesh/$VID/images \
--output $ROOT/$PID/$VID/segmaps/deva_OWLv2_s$SFACTOR \
--amp --temporal_setting semionline --prompt "" \
--DINO_THRESHOLD 0.4 --detector_type owlv2 \
--subsample_factor=$SFACTOR --classes=$ROOT/visor/$VID\_classes.pt
Extract DINO features:
python scripts/extract_features_DEVA.py \
--deva_seg_dir $ROOT/$PID/$VID/segmaps/deva_OWLv2_s$SFACTOR \
--images_dir $ROOT/mesh/$VID/images \
--output_dir <output directory> \
--feature_type dinov2
Extend VISOR annotations using DEVA:
python scripts/deva_groundtruth.py \
--img_path /datasets/EPIC-KITCHENS/$VID/ \
--output /datasets/EPIC-KITCHENS/$VID/visor_DEVA100_segmaps/ \
--amp --temporal_setting online \
--gt_dir $ROOT/$PID/$VID/visor_segmaps/ \
--max_missed_detection_count 100 \
--prompt "dummy1.dummy2"
python scripts/preprocessing/postprocess_deva_gt.py --vid $VID
Run the main tracking pipeline:
python extract_tracks.py \
--beta_l=${BETAL} --beta_c=${BETAC} \
--beta_v=${BETAV} --beta_s=${BETAS} \
--vid=${VID} \
--exp=tracked-final-bv${BETAV}-bs${BETAS}-bc${BETAC}-bl${BETAL}
To verify the reproducibility of the results, we provide the tracking predictions here. You can download the predictions and evaluate them using the following script.
Evaluate OUR results:
# we provide predictions for following hyperparameters and video
BETAV=2
BETAS=10
BETAC=10000
BETAL=10
VID=P01_104
python scripts/eval_deva.py \
--segment_type=tracked-final-bv${BETAV}-bs${BETAS}-bc${BETAC}-bl${BETAL} \
--gt_type=visor_DEVA100_segmaps \
--vid=${VID}
To evaluate the DEVA baseline, replace the segment_type
with deva_OWLv2_s5
.
If you find this work useful, please cite:
@InProceedings{Bhalgat24b,
author = "Yash Bhalgat and Vadim Tschernezki and Iro Laina and Joao F. Henriques and Andrea Vedaldi and Andrew Zisserman",
title = "3D-Aware Instance Segmentation and Tracking in Egocentric Videos",
booktitle = "Asian Conference on Computer Vision",
year = "2024",
organization = "IEEE",
}
This work was funded by EPSRC AIMS CDT EP/S024050/1 and AWS (Y. Bhalgat), NAVER LABS Europe (V. Tschernezki), ERC-CoG UNION 101001212 (A. Vedaldi and I. Laina), EPSRC VisualAI EP/T028572/1 (I. Laina, A. Vedaldi and A. Zisserman), and Royal Academy of Engineering RF\201819\18\163 (J. Henriques).