Description
Describe the bug
When testing the cuVS Python Cagra API for certain inner product datasets, I get a low recall value. I tested the following ANN datasets with k = 100:
Dataset name | Location | Space type | Dimensions | Documents | Normalized | Recall with cuVS API | Recall with FAISS API |
---|---|---|---|---|---|---|---|
coherev2-dbpedia | https://huggingface.co/datasets/navneet1v/datasets/resolve 7652 /main/coherev2-dbpedia.hdf5?download=true | inner product | 4096 | 450,000 | No | 98.6% | 75.5% |
FlickrImagesTextQueries | https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true | inner product | 512 | 1,831,403 | Yes | 11.9% | 82.1% |
marco-tasb | https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true | inner product | 768 | 1,000,000 | No | 51.4% | 93.1% |
cohere-768-ip | https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2 | inner product | 768 | 1,000,000 | No | 12.7% | 82.6% |
I've also added the recall I get when I use the FAISS Cagra Python API. Both the cuVS and Faiss tests used 64 as the intermediate graph degree and 32 as the graph degree.
Except for coherev2-dbpedia
, all datasets gave a significantly lower recall value for cuVS Python API compared with FAISS. I have not seen this issue with the L2 datasets I've tested with.
Steps/Code to reproduce bug
These are the steps to reproduce the issue with cuVS Python API
- On a server with GPUs, clone https://github.com/navneet1v/VectorSearchForge
- Server must have
git
anddocker
installed - Server must have
nvidia
developer tools installed, such asnvidia-smi
andnvidia-container-toolkit
- Server must have
cd
intocuvs_benchmarks
folder, and create a temp directory to store the logs
mkdir ./benchmarks_files
chmod 777 ./benchmarks_files
- Build the docker image:
docker build -t <your_image_name> .
- Run the image:
docker run -v ./benchmarks_files:/tmp/files --gpus all <your_image_name>
The cuVS Cagra API is called in this function: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L303.
The relevant code snippet looks like this:
logging.info(f"Running for workload {workload['dataset_name']}")
file = downloadDataSetForWorkload(workload)
d, xb, ids = prepare_indexing_dataset(file, workload["normalize"])
index_params = cagra.IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")
index = cagra.build(index_params, xb)
d, xq, gt = prepare_search_dataset(file, workload["normalize"])
xq = cp.asarray(xq)
search_params = cagra.SearchParams(itopk_size = 200)
distances, neighbors = cagra.search(search_params, index, xq, 100)
logging.info("Search is done")
neighbors = cp.asnumpy(neighbors)
logging.info(f"Recall at k=100 is : {recall_at_r(neighbors, gt, 100, 100, len(xq))}")
logging.info("Sleeping for 5 seconds")
time.sleep(5)
Expected behavior
The recall should be > 80% for all of the datasets.
Environment details (please complete the following information):
- Environment location: AWS EC2 g5.2xlarge, with Deep Learning Base AMI.
- Type of GPU: 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
- Method of RAFT install: conda, Docker
- cuVS and rapids are installed in this line of the Dockerfile: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/Dockerfile#L5
Additional context
I've tried using a higher intermediate_graph_degree, graph_degree, and itopk_size, but this doesn't really improve the recall. For example, when i set the intermediate_graph_degree and graph_degree to 128 and 64 respectively, and set the itopk_size at 200, the recall for cohere-768-ip was 12.8%. If if i increased itopk_size to 500, the recall was 12.9%.