-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Better documentation for expression system #11915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 9 files ±0 9 suites ±0 3h 20m 31s ⏱️ + 2m 48s Results for commit 1cd7023. ± Comparison against base commit 01a6a37. ♻️ This comment has been updated with latest results. |
see #11918 for CI failures |
The docs build is failing with
This feels unrelated. I suspect this is related to the setuptools 80 release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @fjetter . Some high level comments
|
||
.. currentmodule:: dask.dataframe | ||
|
||
The expression system was originally developed for Dask DataFrames, as implemented in the `dask-expr <https://github.com/dask/dask-expr>`_ project. Early prototypes experimented with `matchpy <https://github.com/HPAC/matchpy>`_, but it was soon replaced with a custom-built system offering simpler and more transparent optimization steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend dropping matchpy (not sure that the history is relevant)
Maybe also point out the larger objective beyond just dataframes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also good to mention motivations here (high level optimization, more compact representation to move around, etc..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't intend to motivate why this is here. If anything, this document is supposed to explain a couple of gotchas around the system. If one is looking for motivation or high level introduction, we have https://docs.dask.org/en/stable/dataframe-optimizer.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that motivating the expression system as a whole is still a good idea. The linked doc mostly says "we do optimizations and here's what they are". I think something higher level would be useful. Not critical though.
Construction | ||
^^^^^^^^^^^^ | ||
|
||
The expression system centers around the Expr class, which represents a computation on a Dask DataFrame. This class is designed for subclassing; each subclass corresponds to a specific computation type (e.g., arithmetic, logical operations, filtering, joins). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks duplicative of the paragraph above.
Instead, expression classes use a dataclass-like interface defined by two attributes: | ||
|
||
* ``_parameters``: List of parameter names | ||
|
||
* ``_defaults``: Dictionary of default values for optional parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff seems good, and probably more important than the note about custom initializers. I recommend leading with it, and probably with other structural parts of the expression system (children, expected methods, etc.) before we get into caveats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree about the importance. I don't intend to document methods in this document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me ask a broader question then, if someone wanted to write an Expression powered version of some other API (maybe Bag or something) what would they read in order to learn how to do that? Is there a doc already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know exactly how that works so it is difficult to write down documentation for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't know how the expression system works, or how bag works?
I think you know how the expression system works. My request is that you write down the information you would want someone to know if they were to start trying to build a new API on top of it. Why might they want to do that? What are the key parts of the interface they should be aware of? etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the old examples about collections and added one using Expressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which also includes an overdue update regarding the task spec
Optimization Procedure | ||
---------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems more important to me than tokenization and caching. I recommend moving it up.
Fuse | ||
^^^^ | ||
|
||
Linear chains of blockwise tasks are combined into a single task, minimizing scheduler overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this could use a brief example. It's not immediately clear to me.
Also, I thought that there was a global pass around deduplication. Is that no longer happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you are referring to. Nothing is explicitly deduplicated since everything is implicitly deduplicated by the tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not actual deduplication at the node level, but things like reading the same parquet file twice, after we push down columns onto a read_parquet expression. My understanding is that there was a non-traversal optimization pass. Something that looked at things globally (my knowledge is out of date though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no global optimization step. this is all down with simplify_up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves documentation for the expression system internals and updates related symbol exports.
- Updated documentation in dask/_expr.py with new docstrings for dask_keys, _layer, and dask_graph.
- Added an "Alias" symbol export in dask/task_spec.py.
- Updated dask/init.py to import new expression-related symbols.
Reviewed Changes
Copilot reviewed 3 out of 8 changed files in this pull request and generated no comments.
File | Description |
---|---|
dask/task_spec.py | Added import and export for the Alias symbol. |
dask/_expr.py | Introduced dask_keys and updated docstrings in _layer and dask_graph. |
dask/init.py | Imported new expression-related classes from _expr. |
Files not reviewed (5)
- docs/source/custom-collections.rst: Language not supported
- docs/source/expr-system-internals.rst: Language not supported
- docs/source/graphs.rst: Language not supported
- docs/source/internals.rst: Language not supported
- docs/source/spec.rst: Language not supported
Comments suppressed due to low confidence (1)
dask/_expr.py:277
- [nitpick] Consider renaming the variable 'name' to a more descriptive identifier like 'key' to improve readability in the task creation comprehension within the _layer method.
for i, name in enumerate(self.__dask_keys__())
Hi folks, hope you don't mind I drop a question here - I was just reading the spec docs and noticed they were modified by this PR 6 hours ago 😄 Would love to know more about the changes here. Are these the result of the work that started with https://blog.dask.org/2023/08/25/dask-expr-introduction ? Is there any writeup on why the legacy "plain" dictionary representation of the graph wasn't enough? Does the new thing fundamentally change how indirect dependencies are expressed? |
Asked this in a more appropriate place https://dask.discourse.group/t/background-on-new-graph-specification/3949?u=astrojuanlu |
This is documenting the high level pieces of the expression system internals. I believe everything that is more detailed that this should be in-code documentation.