8000 Fix duplicate indices in batch NN Descent by jinsolp · Pull Request #702 · rapidsai/cuvs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix duplicate indices in batch NN Descent #702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 11, 2025

Conversation

jinsolp
Copy link
Contributor
@jinsolp jinsolp commented Feb 14, 2025

Purpose of this PR

Handling duplicate indices in batch NN Descent graph.

Resolves the following issues

Notes

  • Also fixed in RAFT here for current use with cuML

Signed-off-by: jinsolp <jinsolp@nvidia.com>
Signed-off-by: jinsolp <jinsolp@nvidia.com>
Signed-off-by: jinsolp <jinsolp@nvidia.com>
@jinsolp jinsolp requested a review from a team as a code owner February 14, 2025 22:49
@github-actions github-actions bot added the cpp label Feb 14, 2025
Signed-off-by: jinsolp <jinsolp@nvidia.com>
Signed-off-by: jinsolp <jinsolp@nvidia.com>
Signed-off-by: jinsolp <jinsolp@nvidia.com>
Signed-off-by: jinsolp <jinsolp@nvidia.com>
Signed-off-by: jinsolp <jinsolp@nvidia.com>
Signed-off-by: jinsolp <jinsolp@nvidia.com>
@cjnolet cjnolet added bug Something isn't working non-breaking Introduces a non-breaking change labels Feb 18, 2025
@@ -241,7 +241,7 @@ class AnnNNDescentBatchTest : public ::testing::TestWithParam<AnnNNDescentBatchI
index_params.metric = ps.metric;
index_params.graph_degree = ps.graph_degree;
index_params.intermediate_graph_degree = 2 * ps.graph_degree;
index_params.max_iterations = 10;
index_params.max_iterations = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still get duplicates with 10 iterations?

10000 Copy link
Contributor Author
@jinsolp jinsolp Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to be safe and make sure that the graph converges properly (I noticed that sometimes it takes more than 10 iters to converge; around 15 ~ 20 for the dataset of this size).
Running enough iterations becomes important in batching, because otherwise randomly initialized indices (with distances float max) stays in the graph, which may cause duplicate indices (one with a proper distance, and the other with a float max distance).
Also, the non-batching test at the top of this file also has 100 as max iterations, so wanted to be consistent!

Copy link
Member
@divyegala divyegala Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the only problem? I thought the bug was that there were duplicates because of minor differences in floating point distances.

We observed a troubling behavior in UMAP with the batching algorithm. It seems that you had to set graph_degree=64 and had to trim it down to n_neighbors=10 in the default case? However, in theory, if intermediate_graph_degree is the same then we should be able to set graph_degree=n_neighbors and still get the same answer. Doing it with a higher graph_degree and then slicing it is causing us to use more memory than necessary.

cc @jcrist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the bug was that there were duplicates because of minor differences in floating point distances.

The duplicate indices issue is solved by initializing the update_counter_. Turns out that while running a few clusters after the first one, update_counter_ sometimes stays as -1 from the first iteration, resulting in not running any of the iterations within NN Descent at all. This results in returning the graph from the previous iteration, resulting in (seemingly) different distances between the two points.

I believe the following issues are more cuML side of things;

It seems that you had to set graph_degree=64 and had to trim it down to n_neighbors=10 in the default case?

The default n_neighbors=10 was not chosen by me (it was set to 10 before using NN Descent, so I just left it like that), and the graph_degree=64 is to match raft side of NN Descent index initialization (which was also the default graph degree for NN Descent before linking it with UMAP).

we should be able to set graph_degree=n_neighbors and still get the same answer. Doing it with a higher graph_degree and then slicing it is causing us to use more memory than necessary.

We do get the same answer. i.e. changing the tests in cuML like this so that the nnd_graph_degree is equal to n_neighbors works fine;

cuml_model = cuUMAP(n_neighbors=10, build_algo=build_algo, build_kwds={"nnd_graph_degree": 10})

Should I change the default values in cuml's umap.pyx do match the default value of n_neighbors for memory efficiency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do get the same answer. i.e. changing the tests in cuML like this so that the nnd_graph_degree is equal to n_neighbors works fine;

To clarify, are you saying after this PR you get the same result with n_neighbors=10, nnd_graph_degree=10 and n_neighbors=10, nnd_graph_degree=64? Before this wasn't the case (we thought this was due to the duplicates problem). If so, we should update cuml to change the defaults (and avoid the copy), but that will need to happen after the coming patch release. I also have a branch somewhere where I did this before we noticed the bug, happy to push that up once this is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I see. Yes, I checked by building cuML with the corresponding fixes in the RAFT branch (linked above in the PR).
Looked into it manually + checked that the cuML tests run properly with the changed nnd_graph_degree too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinsolp thanks for the explanations. In cuML, we do not 8000 need to change the default value of n_neighbors. We just should be setting graph_degree = n_neighbors when running UMAP, so that we can remove the unnecessary matrix slice which is causing overconsumption in memory. Can you replicate this PR in RAFT?

@jcrist can you quickly push up your branch and test if Jinsol's changes work once she has a PR up in RAFT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is already here : )

Copy link
copy-pr-bot bot commented Mar 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@divyegala
Copy link
Member

/ok to test

@jinsolp
Copy link
Contributor Author
jinsolp commented Mar 11, 2025

@divyegala ready to merge!

@divyegala
Copy link
Member
8000

/merge

@rapids-bot rapids-bot bot merged commit bd6d4a9 into rapidsai:branch-25.04 Mar 11, 2025
64 checks passed
@jinsolp jinsolp deleted the batch-nnd branch March 11, 2025 00:32
jiangyinzuo pushed a commit to jiangyinzuo/cuvs that referenced this pull request Mar 27, 2025
### Purpose of this PR
Handling duplicate indices in batch NN Descent graph.

Resolves the following issues
- rapidsai/raft#2450
- rapidsai#559
- rapidsai#753
### Notes
- Also fixed in RAFT [here](rapidsai/raft#2586) for current use with cuML

Authors:
  - Jinsol Park (https://github.com/jinsolp)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp non-breaking Introduces a non-breaking change
Development

Successfully merging this pull request may close these issues.

4 participants
0