[BUG] cuVS Cagra Python API has low recall for inner product datasets

Describe the bug
When testing the cuVS Python Cagra API for certain inner product datasets, I get a low recall value. I tested the following ANN datasets with k = 100:

Dataset name	Location	Space type	Dimensions	Documents	Normalized	Recall with cuVS API	Recall with FAISS API
coherev2-dbpedia	https://huggingface.co/datasets/navneet1v/datasets/resolve 7652 /main/coherev2-dbpedia.hdf5?download=true	inner product	4096	450,000	No	98.6%	75.5%
FlickrImagesTextQueries	https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true	inner product	512	1,831,403	Yes	11.9%	82.1%
marco-tasb	https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true	inner product	768	1,000,000	No	51.4%	93.1%
cohere-768-ip	https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2	inner product	768	1,000,000	No	12.7%	82.6%

I've also added the recall I get when I use the FAISS Cagra Python API. Both the cuVS and Faiss tests used 64 as the intermediate graph degree and 32 as the graph degree.

Except for coherev2-dbpedia, all datasets gave a significantly lower recall value for cuVS Python API compared with FAISS. I have not seen this issue with the L2 datasets I've tested with.

Steps/Code to reproduce bug
These are the steps to reproduce the issue with cuVS Python API

On a server with GPUs, clone https://github.com/navneet1v/VectorSearchForge
- Server must have git and docker installed
- Server must have nvidia developer tools installed, such as nvidia-smi and nvidia-container-toolkit
cd into cuvs_benchmarks folder, and create a temp directory to store the logs

mkdir ./benchmarks_files
chmod 777 ./benchmarks_files

Build the docker image:

docker build -t <your_image_name> .

Run the image:

docker run -v ./benchmarks_files:/tmp/files --gpus all <your_image_name>

The cuVS Cagra API is called in this function: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L303.
The relevant code snippet looks like this:

        logging.info(f"Running for workload {workload['dataset_name']}")
        file = downloadDataSetForWorkload(workload)
        d, xb, ids = prepare_indexing_dataset(file, workload["normalize"])
        index_params = cagra.IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")

        index = cagra.build(index_params, xb)

        d, xq, gt = prepare_search_dataset(file, workload["normalize"])

        xq = cp.asarray(xq)

        search_params = cagra.SearchParams(itopk_size = 200)
        distances, neighbors = cagra.search(search_params, index, xq, 100)

        logging.info("Search is done")
        neighbors = cp.asnumpy(neighbors)

        logging.info(f"Recall at k=100 is : {recall_at_r(neighbors, gt, 100, 100, len(xq))}")
        logging.info("Sleeping for 5 seconds")
        time.sleep(5)

Expected behavior
The recall should be > 80% for all of the datasets.

Environment details (please complete the following information):

Environment location: AWS EC2 g5.2xlarge, with Deep Learning Base AMI.
- Type of GPU: 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
Method of RAFT install: conda, Docker
- cuVS and rapids are installed in this line of the Dockerfile: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/Dockerfile#L5

Additional context
I've tried using a higher intermediate_graph_degree, graph_degree, and itopk_size, but this doesn't really improve the recall. For example, when i set the intermediate_graph_degree and graph_degree to 128 and 64 respectively, and set the itopk_size at 200, the recall for cohere-768-ip was 12.8%. If if i increased itopk_size to 500, the recall was 12.9%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions