Ensure default dask scheduler only compute what's needed #11861

fjetter · 2025-04-02T13:53:31Z

Looks like the default schedulers are not culling and therefore may compute unnecessary results if the provided dsk contains too many keys. That's not something that should actually happen in reality but it's also not great.

I noticed this in #11859 because the code was producing malformed graphs but only a single test in the entire test suite tripped. The test was concerning some bag optimization and was only caught due to the particular way the test was written. This should be more robust.

Copilot

Pull Request Overview

This PR enhances the dask scheduler by ensuring it computes only the necessary tasks. It adjusts waiting_data in tests to match expected behavior, adds a new test to verify the scheduler culling logic, and updates start_state_from_dask in dask/local.py to accept a new keys parameter for improved dependency resolution.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
dask/tests/test_local.py	Updated waiting_data expectations and added a new test for culling
dask/local.py	Updated start_state_from_dask signature and dependency resolution

Comments suppressed due to low confidence (2)

dask/local.py:144

The new 'keys' parameter is not documented in the function's docstring. Please update the docstring to explain its purpose and usage.

def start_state_from_dask(dsk, cache=None, sortkey=None, keys=None):

dask/tests/test_local.py:196

[nitpick] Consider expanding the test coverage by including scenarios with multiple tasks sharing a dependency to further validate that only the needed computations are executed.

def test_ensure_calculate_only_whats_needed():

github-actions · 2025-04-02T14:32:48Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

9 files ±0 9 suites ±0 3h 28m 56s ⏱️ +16s
18 002 tests +1 16 788 ✅ +1 1 214 💤 ±0 0 ❌ ±0
161 092 runs +9 148 983 ✅ +7 12 109 💤 +2 0 ❌ ±0

Results for commit 8118969. ± Comparison against base commit d9e07f5.

♻️ This comment has been updated with latest results.

Copilot

Pull Request Overview

This PR addresses an issue in the Dask scheduler by ensuring that only the necessary tasks are computed, thereby preventing unnecessary computations when the provided dask graph contains excessive keys.

Updated test cases in test_local.py to reflect the optimized behavior.
Modified start_state_from_dask in local.py to accept a keys parameter and cull unneeded computations.
Adjusted dependency tracking and task readiness logic accordingly.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
dask/tests/test_local.py	Updated expected waiting_data values and added a new test to verify only required computations are performed.
dask/local.py	Refactored start_state_from_dask to filter tasks based on provided keys and reworked dependency tracking.

Comments suppressed due to low confidence (1)

dask/local.py:144

[nitpick] Consider renaming the parameter 'keys' to 'target_keys' to improve clarity on its purpose.

def start_state_from_dask(dsk, cache=None, sortkey=None, keys=None):

dask/local.py

hendrikmakait

The test and code look good to me. I havn't dug too deep into the logic of start_state_from_dask but I assume existing tests would blow up if something broke.

fjetter · 2025-04-04T09:17:28Z

but I assume existing tests would blow up if something broke.

Yes. Pretty much everything is running through this.

Ensure default dask scheduler only compute what's needed

db33424

fjetter requested a review from Copilot April 2, 2025 13:53

Copilot AI reviewed Apr 2, 2025

View reviewed changes

fix

8118969

fjetter requested a review from Copilot April 2, 2025 14:50

Copilot AI reviewed Apr 2, 2025

View reviewed changes

dask/local.py Show resolved Hide resolved

hendrikmakait approved these changes Apr 4, 2025

View reviewed changes

fjetter merged commit d9a9dd7 into dask:main Apr 4, 2025
24 checks passed

fjetter deleted the default_dask_scheduler_culling branch April 4, 2025 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ensure default dask scheduler only compute what's needed #11861

Ensure default dask scheduler only compute what's needed #11861

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ensure default dask scheduler only compute what's needed #11861

Ensure default dask scheduler only compute what's needed #11861

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!