-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Ensure default dask scheduler only compute what's needed #11861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the dask scheduler by ensuring it computes only the necessary tasks. It adjusts waiting_data in tests to match expected behavior, adds a new test to verify the scheduler culling logic, and updates start_state_from_dask in dask/local.py to accept a new keys parameter for improved dependency resolution.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
dask/tests/test_local.py | Updated waiting_data expectations and added a new test for culling |
dask/local.py | Updated start_state_from_dask signature and dependency resolution |
Comments suppressed due to low confidence (2)
dask/local.py:144
- The new 'keys' parameter is not documented in the function's docstring. Please update the docstring to explain its purpose and usage.
def start_state_from_dask(dsk, cache=None, sortkey=None, keys=None):
dask/tests/test_local.py:196
- [nitpick] Consider expanding the test coverage by including scenarios with multiple tasks sharing a dependency to further validate that only the needed computations are executed.
def test_ensure_calculate_only_whats_needed():
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 9 files ±0 9 suites ±0 3h 28m 56s ⏱️ +16s Results for commit 8118969. ± Comparison against base commit d9e07f5. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses an issue in the Dask scheduler by ensuring that only the necessary tasks are computed, thereby preventing unnecessary computations when the provided dask graph contains excessive keys.
- Updated test cases in test_local.py to reflect the optimized behavior.
- Modified start_state_from_dask in local.py to accept a keys parameter and cull unneeded computations.
- Adjusted dependency tracking and task readiness logic accordingly.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
dask/tests/test_local.py | Updated expected waiting_data values and added a new test to verify only required computations are performed. |
dask/local.py | Refactored start_state_from_dask to filter tasks based on provided keys and reworked dependency tracking. |
Comments suppressed due to low confidence (1)
dask/local.py:144
- [nitpick] Consider renaming the parameter 'keys' to 'target_keys' to improve clarity on its purpose.
def start_state_from_dask(dsk, cache=None, sortkey=None, keys=None):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test and code look good to me. I havn't dug too deep into the logic of start_state_from_dask
but I assume existing tests would blow up if something broke.
Yes. Pretty much everything is running through this. |
Looks like the default schedulers are not culling and therefore may compute unnecessary results if the provided dsk contains too many keys. That's not something that should actually happen in reality but it's also not great.
I noticed this in #11859 because the code was producing malformed graphs but only a single test in the entire test suite tripped. The test was concerning some bag optimization and was only caught due to the particular way the test was written. This should be more robust.