8000 Optimize vindex by dcherian · Pull Request #11625 · dask/dask · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Optimize vindex #11625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 2, 2025
Merged

Optimize vindex #11625

merged 10 commits into from
Jan 2, 2025

Conversation

dcherian
Copy link
Contributor
@dcherian dcherian commented Dec 27, 2024
  1. Tokenize before broadcasting
  2. avoid copying indexer arrays as much as possible.
  3. Use cached_max, cached_cumsum for determining chunksize and chunk
    bounds
  4. broadcast as late as possible.
  5. Uses an array-centric algorithm instead of toolz.groupby

1. Tokenize before broadcasting
2. avoid copying indexer arrays as much as possible.
3. Use cached_max, cached_cumsum for determining chunksize and chunk
bounds
4. broadcast as late as possible.
This reverts commit e56f72fafc07f7aceae5f05b9b69bb5b3a530a18.
Copy link
Contributor
github-actions bot commented Dec 27, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ±0       15 suites  ±0   3h 59m 27s ⏱️ + 2m 50s
 12 611 tests ±0   11 448 ✅ ±0   1 163 💤 ±0  0 ❌ ±0 
156 885 runs  ±0  140 231 ✅ ±0  16 654 💤 ±0  0 ❌ ±0 

Results for commit e23eac5. ± Comparison against base commit fe465e0.

♻️ This comment has been updated with latest results.

slicer = slice(start, stop)
key = sorted_keys[start]
outblock, *input_blocks = np.unravel_index(key, ravel_shape)
inblock = [_[slicer] for _ in sorted_inblock_idxs]
Copy link
Contributor Author
@dcherian dcherian Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should just scatter out sorted_inblock_idxs as dask arrays and embed a reference to a dask array sliced by these slicers in the task instead of the full array? The graph is quite big for the example in #11018

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah UserWarning: Sending large graph of size 738.91 MiB. for

import dask.array as da
import numpy as np
import scipy as sp

chunksize = 100
size = 10_000
n_points = 5000

X = da.random.poisson(15, (size, size), chunks = (chunksize, chunksize))

index_0 = np.random.randint(0, X.shape[0], n_points)
index_0.sort()
index_1 = np.random.randint(0, X.shape[1], n_points)
index_1.sort()

print('vindex timing:')
X.vindex[np.ix_(index_0, index_1)].compute()

Copy link
Contributor Author
@dcherian dcherian Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if index_0 has size (M,1) and index_1 has size (1, N), it should be possible to embed M+N values instead of the current MxN.

That said, I think this PR can be merged (barring any other comments). Such a graph takes so long to build today no one is using it.

Copy link
Collaborator
@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I agree, we can merge this now, it already reduced graph size by 50%

@phofl phofl merged commit 331fb1a into dask:main Jan 2, 2025
26 checks passed
@phofl
Copy link
Collaborator
phofl commented Jan 2, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vindex as outer indexer: memory and time performance
2 participants
0