This repository provides the code for the technical report [Arxiv], and also serves as a standalone suite for probing and evaluating future methods in interactive segmentation (IS).
The iSegProbe
repository includes:
- Pipelines for training and evaluating interactive segmentation models, specifically adapted for probing individual model components (
train.py
,evaluate.py
) - Implementations of vision backbones, such as ViT, MaskCLIP, and DINOv2, tailored for the interactive segmentation task (
core.model.featurizers
) - Implementations of multiple feature upsamplers, including LiFT, FeatUp, and LoftUp (
core.model.upsamplers
) - Support for major IS datasets: GrabCut, DAVIS, SBD, Berkeley, COCO+LVIS ... (
core.data
) - Visualization utilities for plotting predictions and features, as well as recreating plots from the report
Developed and tested on Python 3.9, PyTorch 2.4.1, CUDA 12.4, Ubuntu 20.04. To install the required dependencies, run:
pip install -r requirements.txt
Download the dataset(s) relevant to your use case and and specify the corresponding paths in the configs/main_cfg.yaml
.
📌 Note: Our experiments were conducted on SBD (train) and GrabCut, DAVIS, Berkeley and SBD (test). However, other datasets are fully supported and can be used with minimal effort.
Dataset | Description | Download Link |
---|---|---|
ADE20k | 22k images with 434k instances (total) | official site |
OpenImages | 944k images with 2.6M instances (total) | official site |
MS COCO | 118k images with 1.2M instances (train) | official site |
LVIS v1.0 | 100k images with 1.2M instances (total) | official site |
COCO+LVIS* | 99k images with 1.5M instances (train) | original LVIS images + combined annotations |
SBD | 8498 images with 20172 instances for (train) 2857 images with 6671 instances for (test) |
official site |
Grab Cut | 50 images with one object each (test) | GrabCut.zip (11 MB) |
Berkeley | 96 images with 100 instances (test) | Berkeley.zip (7 MB) |
DAVIS | 345 images with one object each (test) | DAVIS.zip (43 MB) |
Pascal VOC | 1449 images with 3417 instances (validation) | official site |
COCO_MVal | 800 images with 800 instances (test) | COCO_MVal.zip (127 MB) |
(*) - To prepare COCO+LVIS, first download the original LVIS v1.0 dataset. Then, download and unpack the pre-processed annotations provided by the RITM team, which combine COCO and LVIS. Place the annotations in the same folder as LVIS v1.0.
For an extended list of supported datasets, refer to the SimpleClick dataset collection: [link]
Download upsampler weights and specify the corresponding paths in the configs/main_cfg.yaml
:
- LoftUp (DINOv2 S/14): [Google Drive Link]
- LiFT (DINOv2 S/14): [Google Drive Link]
For additional trained upsamplers, refer to the LoftUp repository: [link]
Evaluation of the vision foundation model (and feature upsampler) involves two separate stages: (1) training the interactive segmentation model, and (2) performing the actual evaluation.
General training configurations are specified in configs/train_cfg.yaml
. For a detailed explanation of the parameters, please refer directly to that file. Each training experiment (containing IS model, datasets and other components) should be defined in a separate Python file, which is then referenced from train_cfg.yaml
. Examples of such files can be found in the models/
directory.
To launch the training process, you can either modify train_cfg.yaml
accordingly and run:
python train.py
Or override specific arguments directly with CLI using Hydra syntax, for example:
python train.py +exp.name=my_name +exp.model_path=/path/to/my/model
General evaluation configurations are specified in configs/eval_cfg.yaml
. For a detailed explanation of the parameters, please refer directly to that file.
To launch the evaluation process, you can either modify eval_cfg.yaml
accordingly and run:
python evaluate.py
Or override specific arguments directly with CLI using Hydra syntax, for example:
python evaluate.py +checkpoint=/path/to/checkpoints +datasets=GrabCut,Berkeley,SBD,DAVIS
-
Training logs can be visualized using TensorBoard and Weights & Biases.
To enable TensorBoard, locate folders with experiments output (could be also some root f 8B15 older containing multiple runs) and run:
tensorboard --logdir=PATH_TO_LOG_DIR --port=6006
To enable logging to W&B, set the
wandb.log_wandb=true
intrain_cfg.yaml
. -
Separate Weights & Biases evaluation logging is available by setting
wandb=true
ineval_cfg.yaml
.
To launch Tkinter-based interactive demo, run:
python demo.py --checkpoint /path/to/ckpts
Demo Controls:
Key | Description |
---|---|
Left Mouse Button | Place a positive click |
Right Mouse Button | Place a negative click |
Scroll Wheel | Zoom an image in and out |
Right Mouse Button + Move Mouse |
Move an image |
Space | Finish the current object mask |
- Some test images can be found in the
assets/test_imgs
folder. - For a more detailed description of the demo parameters and functionality, refer to the RITM codebase.
- When launching the demo from a remote machine, you may need to have X11 (or XQuartz) installed and running on your local machine with proper X11 forwarding.
- If the demo exits incorrectly, the process might not terminate properly, leading to the following error on the next launch:
free(): invalid pointer
To resolve this, kill the demo process by running:
pkill -9 -f demo.py
In the eval_cfg.yaml
file, the vis_preds
flag is responsible for visualizing the model's predictions, while the save_feats
flag controls whether raw features before and after the upsampler are saved. These saved features can be further visualized using the script core.plots.plot_features.py
. Additionally, the script core.plots.plot_iou_vs_clicks.py
can be used to perform a comparison of the mean Intersection over Union (mIoU) as a function of the number of clicks made.
If you find this repository useful, please cite our papers:
@misc{huang2025loftuplearningcoordinatebasedfeature,
title={LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models},
author={Haiwen Huang and Anpei Chen and Volodymyr Havrylov and Andreas Geiger and Dan Zhang},
year={2025},
eprint={2504.14032},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.14032},
}
@misc{havrylov2025benchmarking,
title={Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation},
author={Volodymyr Havrylov and Haiwen Huang and Dan Zhang and Andreas Geiger},
year={2025},
eprint={2505.02075},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.02075},
}
This repository is based on SimpleClick and RITM, with most of the featurizers code adapted from FeatUp. We thank the authors of these open-source projects for their valuable contributions.