-
Notifications
You must be signed in to change notification settings - Fork 3.9k
feat: implement indexes, tests, demos and utilities for IMI-GPU #4301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @AlaoPrado! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Hi @AlaoPrado Thanks for this remarkable effort! Historically, we did not prioritize porting the IMI to GPU because a HNSW coarse quantizer offers better speed-accuracy tradeoffs (on CPU). Therefore, the options in the current Faiss are:
I skimmed through the paper that has a comparison between IVFPQ with flat coarse quantizer and your approach. However it lacks (1) an evaluation with several numbers of centroids (2) a comparison with options 2-4 above. One question: on CPU the most efficient operating points for IMI are obtained by limiting the number of distances to compute (max_codes) rather than the number of centroids to visit (nprobe). This is because with IMI the inverted lists are severely imbalanced. Did you also observe this on GPU? Regarding the code itself, at first glance it seems that changing/adding 100 files is too much. The demos and benchmarks would be much shorter if written in Python. There is code that seems generic across different index types (eg. IMIAppend), so would need a stronger integration in Faiss. |
Hi @mdouze, Thanks for the review. I could enter in more details later, but the following brief comments and images might help on what you said. In our work, we focused on comparing the IVFPQ (or IVFADC) and IMIPQ (or Multi-D-ADC) on GPU. Personally, I would say that, if we consider IVFPQ GPU has shown its efficiency at billion-scale, a new index would be a valuable competitor if it shows better results for this scenario. We have compared IVFPQ and IMIPQ on GPU using three different configurations for the number of centroids, always setting equivalent or lower memory usage for IMIPQ's structure, see the images below for when compressing the vector into 8 codes ( There, the number right next to the index is the number of centroids per codebook, so note IMIPQ has the quadratic number of inverted lists. About your question: we haven't explored implementing this condition for stopping earlier the iteration over the inverted list. What we noticed was that the maximum inverted list length was relatively near each other between configuration, as the following:
Regarding the code, ok. I think we can drastically reduce the number of files if we remove the bash scripts for testing. For the other part of the code, let me check if it is possible to turn them into a unified template. Do you think this data helped contribute to the discussion? Thanks, and Kind Regards. |
… folder since they aren't strictly necessary
Hi, Sorry for the delay. I had to pause for a while and focus on other personal priorities. But I am going to check if there is a refactor that could be done for decreasing the number of changes. In the meanwhile, I noticed the action was failing due to a missing MPI package in the runner and I expect to have it fixed now. Additionally, I removed the The action was failing, because the runner was missing a MPI library required by a new demo for IVFPQ and IMIPQ on multiple GPUs. I just added a conditional, so the demo isn't compiled if the library isn't found. Thanks, and Kind Regards, |
Hi Folks!
In the last few years, I have been working on a project with some guys from the Department of Computer Science at UFMG (DCC/UFMG) and associates to implement and validate the Inverted Multi-Index for GPU (IMI-GPU). We recently published our work, "IMI-GPU: Inverted multi-index for billion-scale approximate nearest neighbor search with GPUs" in the Journal of Parallel and Distributed Computing, and now I come with the implemented code to contribute back to you.
So, this PR contains all the changes I considered could be relevant after merging with the original code, solving the conflicts, aligning few stuff and making a squash commit. Below, I list the authors of the project, the links to the publication, the summary, architectural decisions, limitation, of the changes.
Please let me know what are your thoughts for merging this. I have tried to make things as modular as possible without being invasive while working in parallel on a different repository, but it has been a time since the initial fork and I assume there might something to be improved and new good practices to be applied. But it would be great if can have this deployed without major changes and open new issues for future big improvements/refactoring. I am looking forward to find the best way merge this back to the original code :)
Authors of the Project
Publication
Summary
This code implements a new
GpuIndexIMIPQ
class, the main C++ API for initializing and accessing the IMI-GPU. Along with it, there is alsoGpuIndexIMI
, its abstract parent class, and theGpuMultiIndex2
, the IMI coarse quantizer with two codebooks. The GPU implementation of IMI-GPU and its coarse quantizer are repesctivelyIMIPQ
andMultiIndex2
. In addition to the indexes, the multi-sequence algorithm, a central part of this solution can be found in thefaiss/gpu/utils/MultiSequence
files. Finally, the search steps of the originalIVFPQ
were adapted and new variations were implemented in theirs respective files forIMIPQ
. We have tests and demos for the main new indexes and algorithms infaiss/gpu/test
andfaiss/gpu/demos
. These demos contain the main code to reproduce our main experiments and few extra features. This includes the capability to run both IMIPQ and IVFPQ in CPU and GPU with C++, with multiple index shards/replicas and processes with OpenMP and MPI. For our experiments, we used custom bash scripts with personal environment variables, but I extracted the general examples of use and put them intofaiss/gpu/demos/scripts
.Architectural Decisions
As I commented, the idea was to not be invasive. In this way, we could be fair about the comparisons between IMI-GPU and IVFPQ. This resulted in some duplication for the adapted flow from IVFPQ, For example, a possible main point of concern is the code duplicated between
GpuIndexIMI
andGpuIndex
. Sincex
is always a continuous list of d-dimensional vectors, and IMI-GPU must work onx
split into two sets of (d/2)-dimension sub-vectors, we decided to put put the split step into thesearch
function and replicated it. But this could be improved later as by something as calling asearchPreImpl_
that would call the currentsearchImpl_
function. Anyway, we could wait the review for more details. Furthermore, you can find the distance computation in the inverted list phase duplicated for IMIPQ. Because it wasn’t so trivial to abstract the Multi-Index as a coarse quantizer for IVFPQ since we expect different output from it, and a different way of managing the precomputed codes when iterating over the inverted lists.Limitations
short
(16 bits) ids for each centroid in the codebooks of the coarse quantizer, orshort2
(32 bits) pair id when considering the id of the virtual centroid. But the coarse quantizer already supports working withidx_t
(orint
after replicating existing code changed at some point to work only withidx_t
) andint2
inputs and outputs. For finishing the overall support, the multi-sequence algorithm must be reviewed, and the distance computation functions for the inverted lists must be adapted.Thanks, and Kind Regards.