8000 feat: implement indexes, tests, demos and utilities for IMI-GPU by AlaoPrado · Pull Request #4301 · facebookresearch/faiss · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: implement indexes, tests, demos and utilities for IMI-GPU #4301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

AlaoPrado
Copy link

Hi Folks!

In the last few years, I have been working on a project with some guys from the Department of Computer Science at UFMG (DCC/UFMG) and associates to implement and validate the Inverted Multi-Index for GPU (IMI-GPU). We recently published our work, "IMI-GPU: Inverted multi-index for billion-scale approximate nearest neighbor search with GPUs" in the Journal of Parallel and Distributed Computing, and now I come with the implemented code to contribute back to you.

So, this PR contains all the changes I considered could be relevant after merging with the original code, solving the conflicts, aligning few stuff and making a squash commit. Below, I list the authors of the project, the links to the publication, the summary, architectural decisions, limitation, of the changes.

Please let me know what are your thoughts for merging this. I have tried to make things as modular as possible without being invasive while working in parallel on a different repository, but it has been a time since the initial fork and I assume there might something to be improved and new good practices to be applied. But it would be great if can have this deployed without major changes and open new issues for future big improvements/refactoring. I am looking forward to find the best way merge this back to the original code :)

Authors of the Project

  • Alan Araujo (Me)
  • Willian Barreiros Jr.
  • Jun Kong
  • Renato Ferreira
  • George Teodoro

Publication

Summary
This code implements a new GpuIndexIMIPQ class, the main C++ API for initializing and accessing the IMI-GPU. Along with it, there is also GpuIndexIMI, its abstract parent class, and the GpuMultiIndex2, the IMI coarse quantizer with two codebooks. The GPU implementation of IMI-GPU and its coarse quantizer are repesctively IMIPQ and MultiIndex2. In addition to the indexes, the multi-sequence algorithm, a central part of this solution can be found in the faiss/gpu/utils/MultiSequence files. Finally, the search steps of the original IVFPQ were adapted and new variations were implemented in theirs respective files for IMIPQ. We have tests and demos for the main new indexes and algorithms in faiss/gpu/test and faiss/gpu/demos. These demos contain the main code to reproduce our main experiments and few extra features. This includes the capability to run both IMIPQ and IVFPQ in CPU and GPU with C++, with multiple index shards/replicas and processes with OpenMP and MPI. For our experiments, we used custom bash scripts with personal environment variables, but I extracted the general examples of use and put them into faiss/gpu/demos/scripts.

Architectural Decisions
As I commented, the idea was to not be invasive. In this way, we could be fair about the comparisons between IMI-GPU and IVFPQ. This resulted in some duplication for the adapted flow from IVFPQ, For example, a possible main point of concern is the code duplicated between GpuIndexIMI and GpuIndex. Since x is always a continuous list of d-dimensional vectors, and IMI-GPU must work on x split into two sets of (d/2)-dimension sub-vectors, we decided to put put the split step into the search function and replicated it. But this could be improved later as by something as calling a searchPreImpl_ that would call the current searchImpl_ function. Anyway, we could wait the review for more details. Furthermore, you can find the distance computation in the inverted list phase duplicated for IMIPQ. Because it wasn’t so trivial to abstract the Multi-Index as a coarse quantizer for IVFPQ since we expect different output from it, and a different way of managing the precomputed codes when iterating over the inverted lists.

Limitations

  • This implements only the C++ API, it still lacks the Python API.
  • IMIPQ is currently indexing up to 2^32 inverted list, because the it is assuming a short (16 bits) ids for each centroid in the codebooks of the coarse quantizer, or short2 (32 bits) pair id when considering the id of the virtual centroid. But the coarse quantizer already supports working with idx_t (orint after replicating existing code changed at some point to work only with idx_t) and int2 inputs and outputs. For finishing the overall support, the multi-sequence algorithm must be reviewed, and the distance computation functions for the inverted lists must be adapted.
  • Only the L2 metric and 32-bit floating point distances are supported for IMIPQ.

Thanks, and Kind Regards.

@facebook-github-bot
Copy link
Contributor

Hi @AlaoPrado!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@mdouze
Copy link
Contributor
mdouze commented Apr 22, 2025

Hi @AlaoPrado

Thanks for this remarkable effort!

Historically, we did not prioritize porting the IMI to GPU because a HNSW coarse quantizer offers better speed-accuracy tradeoffs (on CPU). Therefore, the options in the current Faiss are:

  1. use a flat coarse quantizer
  2. decrease the number of centroids, which slows down the scanning phase
  3. use a CPU HNSW
  4. use the Cagra graph-based implementation (with CuVS enabled)
    so it could be that your GPU implementation reaches operating points that are not available before.

I skimmed through the paper that has a comparison between IVFPQ with flat coarse quantizer and your approach. However it lacks (1) an evaluation with several numbers of centroids (2) a comparison with options 2-4 above.

One question: on CPU the most efficient operating points for IMI are obtained by limiting the number of distances to compute (max_codes) rather than the number of centroids to visit (nprobe). This is because with IMI the inverted lists are severely imbalanced. Did you also observe this on GPU?

Regarding the code itself, at first glance it seems that changing/adding 100 files is too much. The demos and benchmarks would be much shorter if written in Python. There is code that seems generic across different index types (eg. IMIAppend), so would need a stronger integration in Faiss.

@AlaoPrado
Copy link
Author

Hi @mdouze,

Thanks for the review. I could enter in more details later, but the following brief comments and images might help on what you said.

In our work, we focused on comparing the IVFPQ (or IVFADC) and IMIPQ (or Multi-D-ADC) on GPU. Personally, I would say that, if we consider IVFPQ GPU has shown its efficiency at billion-scale, a new index would be a valuable competitor if it shows better results for this scenario. We have compared IVFPQ and IMIPQ on GPU using three different configurations for the number of centroids, always setting equivalent or lower memory usage for IMIPQ's structure, see the images below for when compressing the vector into 8 codes (m = 8), 10,000 queries, and SIFT1B:
chart (1)
chart (2)
chart (3)
Note: QPS means Query Per Second.

There, the number right next to the index is the number of centroids per codebook, so note IMIPQ has the quadratic number of inverted lists.

About your question: we haven't explored implementing this condition for stopping earlier the iteration over the inverted list. What we noticed was that the maximum inverted list length was relatively near each other between configuration, as the following:

Multi-D-ADC 9742 Multi-D-ADC 11876 Multi-D-ADC 15560
181,881 99,520 100,514
IVFADC 65535 IVFADC 126491 IVFADC 252982
168,669 81,275 73,432

Regarding the code, ok. I think we can drastically reduce the number of files if we remove the bash scripts for testing. For the other part of the code, let me check if it is possible to turn them into a unified template.

Do you think this data helped contribute to the discussion?

Thanks, and Kind Regards.

@AlaoPrado
Copy link
Author

Hi,

Sorry for the delay. I had to pause for a while and focus on other personal priorities. But I am going to check if there is a refactor that could be done for decreasing the number of changes. In the meanwhile, I noticed the action was failing due to a missing MPI package in the runner and I expect to have it fixed now. Additionally, I removed the scripts folder from the faiss/gpu/demos for decreasing the number of files changed. I hope it helps.

The action was failing, because the runner was missing a MPI library required by a new demo for IVFPQ and IMIPQ on multiple GPUs. I just added a conditional, so the demo isn't compiled if the library isn't found.

Thanks, and Kind Regards,
Alan

@mnorris11 mnorris11 added the GPU label May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0