8000 [BUG] cuVS Cagra Python API has low recall for inner product datasets · Issue #841 · rapidsai/cuvs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[BUG] cuVS Cagra Python API has low recall for inner product datasets #841
Closed
@rchitale7

Description

@rchitale7

Describe the bug
When testing the cuVS Python Cagra API for certain inner product datasets, I get a low recall value. I tested the following ANN datasets with k = 100:

Dataset name Location Space type Dimensions Documents Normalized Recall with cuVS API Recall with FAISS API
coherev2-dbpedia https://huggingface.co/datasets/navneet1v/datasets/resolve 7652 /main/coherev2-dbpedia.hdf5?download=true inner product 4096 450,000 No 98.6% 75.5%
FlickrImagesTextQueries https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true inner product 512 1,831,403 Yes 11.9% 82.1%
marco-tasb https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true inner product 768 1,000,000 No 51.4% 93.1%
cohere-768-ip https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2 inner product 768 1,000,000 No 12.7% 82.6%

I've also added the recall I get when I use the FAISS Cagra Python API. Both the cuVS and Faiss tests used 64 as the intermediate graph degree and 32 as the graph degree.

Except for coherev2-dbpedia, all datasets gave a significantly lower recall value for cuVS Python API compared with FAISS. I have not seen this issue with the L2 datasets I've tested with.

Steps/Code to reproduce bug
These are the steps to reproduce the issue with cuVS Python API

  1. On a server with GPUs, clone https://github.com/navneet1v/VectorSearchForge
    • Server must have git and docker installed
    • Server must have nvidia developer tools installed, such as nvidia-smi and nvidia-container-toolkit
  2. cd into cuvs_benchmarks folder, and create a temp directory to store the logs
mkdir ./benchmarks_files
chmod 777 ./benchmarks_files
  1. Build the docker image:
docker build -t <your_image_name> .
  1. Run the image:
docker run -v ./benchmarks_files:/tmp/files --gpus all <your_image_name>

The cuVS Cagra API is called in this function: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L303.
The relevant code snippet looks like this:

        logging.info(f"Running for workload {workload['dataset_name']}")
        file = downloadDataSetForWorkload(workload)
        d, xb, ids = prepare_indexing_dataset(file, workload["normalize"])
        index_params = cagra.IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")

        index = cagra.build(index_params, xb)

        d, xq, gt = prepare_search_dataset(file, workload["normalize"])

        xq = cp.asarray(xq)

        search_params = cagra.SearchParams(itopk_size = 200)
        distances, neighbors = cagra.search(search_params, index, xq, 100)

        logging.info("Search is done")
        neighbors = cp.asnumpy(neighbors)

        logging.info(f"Recall at k=100 is : {recall_at_r(neighbors, gt, 100, 100, len(xq))}")
        logging.info("Sleeping for 5 seconds")
        time.sleep(5)

Expected behavior
The recall should be > 80% for all of the datasets.

Environment details (please complete the following information):

Additional context
I've tried using a higher intermediate_graph_degree, graph_degree, and itopk_size, but this doesn't really improve the recall. For example, when i set the intermediate_graph_degree and graph_degree to 128 and 64 respectively, and set the itopk_size at 200, the recall for cohere-768-ip was 12.8%. If if i increased itopk_size to 500, the recall was 12.9%.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0