feat: implement indexes, tests, demos and utilities for IMI-GPU #4301

AlaoPrado · 2025-04-19T22:53:18Z

Hi Folks!

In the last few years, I have been working on a project with some guys from the Department of Computer Science at UFMG (DCC/UFMG) and associates to implement and validate the Inverted Multi-Index for GPU (IMI-GPU). We recently published our work, "IMI-GPU: Inverted multi-index for billion-scale approximate nearest neighbor search with GPUs" in the Journal of Parallel and Distributed Computing, and now I come with the implemented code to contribute back to you.

So, this PR contains all the changes I considered could be relevant after merging with the original code, solving the conflicts, aligning few stuff and making a squash commit. Below, I list the authors of the project, the links to the publication, the summary, architectural decisions, limitation, of the changes.

Please let me know what are your thoughts for merging this. I have tried to make things as modular as possible without being invasive while working in parallel on a different repository, but it has been a time since the initial fork and I assume there might something to be improved and new good practices to be applied. But it would be great if can have this deployed without major changes and open new issues for future big improvements/refactoring. I am looking forward to find the best way merge this back to the original code :)

Authors of the Project

Alan Araujo (Me)
Willian Barreiros Jr.
Jun Kong
Renato Ferreira
George Teodoro

Publication

Peer-Reviewed: https://doi.org/10.1016/j.jpdc.2025.105066
Preprint: https://ssrn.com/abstract=4839994

Summary
This code implements a new GpuIndexIMIPQ class, the main C++ API for initializing and accessing the IMI-GPU. Along with it, there is also GpuIndexIMI, its abstract parent class, and the GpuMultiIndex2, the IMI coarse quantizer with two codebooks. The GPU implementation of IMI-GPU and its coarse quantizer are repesctively IMIPQ and MultiIndex2. In addition to the indexes, the multi-sequence algorithm, a central part of this solution can be found in the faiss/gpu/utils/MultiSequence files. Finally, the search steps of the original IVFPQ were adapted and new variations were implemented in theirs respective files for IMIPQ. We have tests and demos for the main new indexes and algorithms in faiss/gpu/test and faiss/gpu/demos. These demos contain the main code to reproduce our main experiments and few extra features. This includes the capability to run both IMIPQ and IVFPQ in CPU and GPU with C++, with multiple index shards/replicas and processes with OpenMP and MPI. For our experiments, we used custom bash scripts with personal environment variables, but I extracted the general examples of use and put them into faiss/gpu/demos/scripts.

Architectural Decisions
As I commented, the idea was to not be invasive. In this way, we could be fair about the comparisons between IMI-GPU and IVFPQ. This resulted in some duplication for the adapted flow from IVFPQ, For example, a possible main point of concern is the code duplicated between GpuIndexIMI and GpuIndex. Since x is always a continuous list of d-dimensional vectors, and IMI-GPU must work on x split into two sets of (d/2)-dimension sub-vectors, we decided to put put the split step into the search function and replicated it. But this could be improved later as by something as calling a searchPreImpl_ that would call the current searchImpl_ function. Anyway, we could wait the review for more details. Furthermore, you can find the distance computation in the inverted list phase duplicated for IMIPQ. Because it wasn’t so trivial to abstract the Multi-Index as a coarse quantizer for IVFPQ since we expect different output from it, and a different way of managing the precomputed codes when iterating over the inverted lists.

Limitations

This implements only the C++ API, it still lacks the Python API.
IMIPQ is currently indexing up to 2^32 inverted list, because the it is assuming a short (16 bits) ids for each centroid in the codebooks of the coarse quantizer, or short2 (32 bits) pair id when considering the id of the virtual centroid. But the coarse quantizer already supports working with idx_t (orint after replicating existing code changed at some point to work only with idx_t) and int2 inputs and outputs. For finishing the overall support, the multi-sequence algorithm must be reviewed, and the distance computation functions for the inverted lists must be adapted.
Only the L2 metric and 32-bit floating point distances are supported for IMIPQ.

Thanks, and Kind Regards.

facebook-github-bot · 2025-04-19T22:53:25Z

Hi @AlaoPrado!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

facebook-github-bot · 2025-04-20T00:08:09Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

mdouze · 2025-04-22T07:09:54Z

Hi @AlaoPrado

Thanks for this remarkable effort!

Historically, we did not prioritize porting the IMI to GPU because a HNSW coarse quantizer offers better speed-accuracy tradeoffs (on CPU). Therefore, the options in the current Faiss are:

use a flat coarse quantizer
decrease the number of centroids, which slows down the scanning phase
use a CPU HNSW
use the Cagra graph-based implementation (with CuVS enabled)
so it could be that your GPU implementation reaches operating points that are not available before.

I skimmed through the paper that has a comparison between IVFPQ with flat coarse quantizer and your approach. However it lacks (1) an evaluation with several numbers of centroids (2) a comparison with options 2-4 above.

One question: on CPU the most efficient operating points for IMI are obtained by limiting the number of distances to compute (max_codes) rather than the number of centroids to visit (nprobe). This is because with IMI the inverted lists are severely imbalanced. Did you also observe this on GPU?

Regarding the code itself, at first glance it seems that changing/adding 100 files is too much. The demos and benchmarks would be much shorter if written in Python. There is code that seems generic across different index types (eg. IMIAppend), so would need a stronger integration in Faiss.

AlaoPrado · 2025-04-23T02:17:55Z

Hi @mdouze,

Thanks for the review. I could enter in more details later, but the following brief comments and images might help on what you said.

In our work, we focused on comparing the IVFPQ (or IVFADC) and IMIPQ (or Multi-D-ADC) on GPU. Personally, I would say that, if we consider IVFPQ GPU has shown its efficiency at billion-scale, a new index would be a valuable competitor if it shows better results for this scenario. We have compared IVFPQ and IMIPQ on GPU using three different configurations for the number of centroids, always setting equivalent or lower memory usage for IMIPQ's structure, see the images below for when compressing the vector into 8 codes (m = 8), 10,000 queries, and SIFT1B:

Note: QPS means Query Per Second.

There, the number right next to the index is the number of centroids per codebook, so note IMIPQ has the quadratic number of inverted lists.

About your question: we haven't explored implementing this condition for stopping earlier the iteration over the inverted list. What we noticed was that the maximum inverted list length was relatively near each other between configuration, as the following:

Multi-D-ADC 9742	Multi-D-ADC 11876	Multi-D-ADC 15560
181,881	99,520	100,514

IVFADC 65535	IVFADC 126491	IVFADC 252982
168,669	81,275	73,432

Regarding the code, ok. I think we can drastically reduce the number of files if we remove the bash scripts for testing. For the other part of the code, let me check if it is possible to turn them into a unified template.

Do you think this data helped contribute to the discussion?

Thanks, and Kind Regards.

… folder since they aren't strictly necessary

AlaoPrado · 2025-05-17T21:51:42Z

Hi,

Sorry for the delay. I had to pause for a while and focus on other personal priorities. But I am going to check if there is a refactor that could be done for decreasing the number of changes. In the meanwhile, I noticed the action was failing due to a missing MPI package in the runner and I expect to have it fixed now. Additionally, I removed the scripts folder from the faiss/gpu/demos for decreasing the number of files changed. I hope it helps.

The action was failing, because the runner was missing a MPI library required by a new demo for IVFPQ and IMIPQ on multiple GPUs. I just added a conditional, so the demo isn't compiled if the library isn't found.

Thanks, and Kind Regards,
Alan

…ss_imipq

feat: implement indexes, tests, demos and utilities for IMI-GPU

11dc452

facebook-github-bot added the CLA Signed label Apr 20, 2025

satymish added the Implementation label Apr 22, 2025

AlaoPrado added 2 commits May 17, 2025 18:40

Merge branch 'main' into faiss_imipq

4afb07d

fix(faiss/gpu/demos): add a conditional for MPI demos; remove scripts…

99e5ccb

… folder since they aren't strictly necessary

AlaoPrado added 4 commits May 25, 2025 21:40

refactor(faiss/gpu/impl): remove unused code from IMIAppend

2182ba2

fix(faiss): remove unused include files from IndexShards

4e7825c

Merge branch 'faiss_imipq' of github.com:Efficient-ANN/faiss into fai…

8102906

…ss_imipq

Merge branch 'main' into faiss_imipq

d2fe945

mnorris11 added the GPU label May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement indexes, tests, demos and utilities for IMI-GPU #4301

feat: implement indexes, tests, demos and utilities for IMI-GPU #4301

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: implement indexes, tests, demos and utilities for IMI-GPU #4301

Are you sure you want to change the base?

feat: implement indexes, tests, demos and utilities for IMI-GPU #4301

Uh oh!

Conversation

Uh oh!

Action Required

Process

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!