Detailed Analysis of VLMs for Cross-Modal Document Retrieval

About This Project

This project was developed for the TU Delft DSAIT4050 Information Retrieval course (2025). It conducts a comprehensive analysis of various Vision-Language Models (VLMs) in cross-modal document retrieval tasks using the Vidore Benchmark. Our work extends the original benchmark with additional retrievers and significant code modifications to enable more thorough performance evaluation.

Main Contributions

Implementation of several custom retrievers to improve evaluation capabilities
Modification of the benchmark source code to support image segmentation as a preprocessing step
Significant performance improvements in retrieval tasks through image partitioning techniques
Comprehensive evaluation and analysis of different retrieval approaches

Acknowledgement

This project is a fork of illuin-tech/vidore-benchmark. We thank the original authors for their valuable work.

Detailed Modifications

1. Custom Retrievers

In this fork, we have added custom retrievers for specific testing purposes. You can find the implementation of these retrievers under src/vidore_benchmark/retrievers directory. These new retrievers leverage advanced techniques to significantly improve retrieval performance compared to baseline models, especially when combined with our image segmentation preprocessing.

2. Image Segmentation

This fork includes image segmentation functionality that enhances retrieval performance by dividing images into smaller segments for more granular analysis. The image segmentation parameters can be used during evaluation:

--image-segmentation: Activates the image segmentation pipeline (boolean flag)
--grid-rows: Number of rows for dividing the image (default: 2)
--grid-cols: Number of columns for dividing the image (default: 2)
--overlap: Overlap percentage between segments, ranging from 0.0 to 0.5 (default: 0.0)
--combine-method: Method to combine segment scores (default: "max")
- max: Uses the maximum score across all segments of an image
- avg: Uses the average score across all segments
- sum: Sums up scores from all segments

Example usage:

python -m vidore_benchmark.cli.main evaluate-retriever \
    --model-class jina-clip-v1 \
    --model-name jinaai/jina-clip-v1 \
    --image-segmentation \
    --grid-rows 2 \
    --grid-cols 2 \
    --combine-method max \
    --dataset-name vidore/arxivqa_test_subsampled \
    --dataset-format qa \
    --split test \
    --output-dir segmentedOutput/
    # --overlap 0 \

When image segmentation is enabled, the evaluation results will include a suffix in the model ID indicating the segmentation configuration (e.g., YourModel_seg_3x3_overlap_20).

Run in Colab

You can run this benchmark in Google Colab using the following link: Run in Colab

Results

See folder results.

Name		Name	Last commit message	Last commit date
Latest commit History 353 Commits
figs		figs
results		results
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
model_comparison.ipynb		model_comparison.ipynb
normal_vs_segmented_comparison.ipynb		normal_vs_segmented_comparison.ipynb
run.py		run.py
run_from_lib.sh		run_from_lib.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Detailed Analysis of VLMs for Cross-Modal Document Retrieval

About This Project

Main Contributions

Acknowledgement

Detailed Modifications

1. Custom Retrievers

2. Image Segmentation

Run in Colab

Results

About

Uh oh!

Releases

Packages

Languages

github-bowen/vidore-benchmark

Folders and files

Latest commit

History

Repository files navigation

Detailed Analysis of VLMs for Cross-Modal Document Retrieval

About This Project

Main Contributions

Acknowledgement

Detailed Modifications

1. Custom Retrievers

2. Image Segmentation

Run in Colab

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages