8000 Ensure default dask scheduler only compute what's needed by fjetter · Pull Request #11861 · dask/dask · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Ensure default dask scheduler only compute what's needed #11861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 4, 2025

Conversation

fjetter
Copy link
Member
@fjetter fjetter commented Apr 2, 2025

Looks like the default schedulers are not culling and therefore may compute unnecessary results if the provided dsk contains too many keys. That's not something that should actually happen in reality but it's also not great.

I noticed this in #11859 because the code was producing malformed graphs but only a single test in the entire test suite tripped. The test was concerning some bag optimization and was only caught due to the particular way the test was written. This should be more robust.

@fjetter fjetter requested a review from Copilot April 2, 2025 13:53
Copy link
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the dask scheduler by ensuring it computes only the necessary tasks. It adjusts waiting_data in tests to match expected behavior, adds a new test to verify the scheduler culling logic, and updates start_state_from_dask in dask/local.py to accept a new keys parameter for improved dependency resolution.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
dask/tests/test_local.py Updated waiting_data expectations and added a new test for culling
dask/local.py Updated start_state_from_dask signature and dependency resolution
Comments suppressed due to low confidence (2)

dask/local.py:144

  • The new 'keys' parameter is not documented in the function's docstring. Please update the docstring to explain its purpose and usage.
def start_state_from_dask(dsk, cache=None, sortkey=None, keys=None):

dask/tests/test_local.py:196

  • [nitpick] Consider expanding the test coverage by including scenarios with multiple tasks sharing a dependency to further validate that only the needed computations are executed.
def test_ensure_calculate_only_whats_needed():

Copy link
Contributor
github-actions bot commented Apr 2, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ±0        9 suites  ±0   3h 28m 56s ⏱️ +16s
 18 002 tests +1   16 788 ✅ +1   1 214 💤 ±0  0 ❌ ±0 
161 092 runs  +9  148 983 ✅ +7  12 109 💤 +2  0 ❌ ±0 

Results for commit 8118969. ± Comparison against base commit d9e07f5.

♻️ This comment has been updated with latest results.

@fjetter fjetter requested a review from Copilot April 2, 2025 14:50
Copy link
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses an issue in the Dask scheduler by ensuring that only the necessary tasks are computed, thereby preventing unnecessary computations when the provided dask graph contains excessive keys.

  • Updated test cases in test_local.py to reflect the optimized behavior.
  • Modified start_state_from_dask in local.py to accept a keys parameter and cull unneeded computations.
  • Adjusted dependency tracking and task readiness logic accordingly.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
dask/tests/test_local.py Updated expected waiting_data values and added a new test to verify only required computations are performed.
dask/local.py Refactored start_state_from_dask to filter tasks based on provided keys and reworked dependency tracking.
Comments suppressed due to low confidence (1)

dask/local.py:144

  • [nitpick] Consider renaming the parameter 'keys' to 'target_keys' to improve clarity on its purpose.
def start_state_from_dask(dsk, cache=None, sortkey=None, keys=None):

Copy link
Member
@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test and code look good to me. I havn't dug too deep into the logic of start_state_from_dask but I assume existing tests would blow up if something broke.

@fjetter
Copy link
Member Author
fjetter commented Apr 4, 2025

but I assume existing tests would blow up if something broke.

Yes. Pretty much everything is running through this.

@fjetter fjetter merged commit d9a9dd7 into dask:main Apr 4, 2025
24 checks passed
@fjetter fjetter deleted the default_dask_scheduler_culling branch April 4, 2025 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0