8000 Better documentation for expression system by fjetter · Pull Request #11915 · dask/dask · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Better documentation for expression system #11915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 5, 2025
Merged

Conversation

fjetter
Copy link
Member
@fjetter fjetter commented Apr 28, 2025

This is documenting the high level pieces of the expression system internals. I believe everything that is more detailed that this should be in-code documentation.

Copy link
Contributor
github-actions bot commented Apr 28, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ±0        9 suites  ±0   3h 20m 31s ⏱️ + 2m 48s
 18 016 tests ±0   16 802 ✅ ±0   1 214 💤 ±0  0 ❌ ±0 
161 218 runs  ±0  149 112 ✅  - 2  12 106 💤 +2  0 ❌ ±0 

Results for commit 1cd7023. ± Comparison against base commit 01a6a37.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member Author
fjetter commented Apr 28, 2025

see #11918 for CI failures

@fjetter
Copy link
Member Author
fjetter commented Apr 28, 2025

The docs build is failing with


Preparing metadata (setup.py): finished with status 'error'
--
254 | error: subprocess-exited-with-error
255 |  
256 | × python setup.py egg_info did not run successfully.
257 | │ exit code: 1
258 | ╰─> [1 lines of output]
259 | error in numpydoc setup command: "values of 'package_data' dict" must be of type <tuple[str, ...] \| list[str]> (got 'tests')
260 | [end of output]
261 |  
262 | note: This error originates from a subprocess, and is likely not a problem with pip.
263 | error: metadata-generation-failed
264 |  
265 | × Encountered error while generating package metadata.
266 | ╰─> See above for output.


This feels unrelated. I suspect this is related to the setuptools 80 release

Copy link
Member
@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @fjetter . Some high level comments


.. currentmodule:: dask.dataframe

The expression system was originally developed for Dask DataFrames, as implemented in the `dask-expr <https://github.com/dask/dask-expr>`_ project. Early prototypes experimented with `matchpy <https://github.com/HPAC/matchpy>`_, but it was soon replaced with a custom-built system offering simpler and more transparent optimization steps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend dropping matchpy (not sure that the history is relevant)

Maybe also point out the larger objective beyond just dataframes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also good to mention motivations here (high level optimization, more compact representation to move around, etc..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't intend to motivate why this is here. If anything, this document is supposed to explain a couple of gotchas around the system. If one is looking for motivation or high level introduction, we have https://docs.dask.org/en/stable/dataframe-optimizer.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that motivating the expression system as a whole is still a good idea. The linked doc mostly says "we do optimizations and here's what they are". I think something higher level would be useful. Not critical though.

Construction
^^^^^^^^^^^^

The expression system centers around the Expr class, which represents a computation on a Dask DataFrame. This class is designed for subclassing; each subclass corresponds to a specific computation type (e.g., arithmetic, logical operations, filtering, joins).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks duplicative of the paragraph above.

Comment on lines +26 to +31
Instead, expression classes use a dataclass-like interface defined by two attributes:

* ``_parameters``: List of parameter names

* ``_defaults``: Dictionary of default values for optional parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff seems good, and probably more important than the note about custom initializers. I recommend leading with it, and probably with other structural parts of the expression system (children, expected methods, etc.) before we get into caveats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree about the importance. I don't intend to document methods in this document

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me ask a broader question then, if someone wanted to write an Expression powered version of some other API (maybe Bag or something) what would they read in order to learn how to do that? Is there a doc already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know exactly how that works so it is difficult to write down documentation for that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't know how the expression system works, or how bag works?

I think you know how the expression system works. My request is that you write down the information you would want someone to know if they were to start trying to build a new API on top of it. Why might they want to do that? What are the key parts of the interface they should be aware of? etc..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the old examples about collections and added one using Expressions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which also includes an overdue update regarding the task spec

Comment on lines +91 to +88
Optimization Procedure
----------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems more important to me than tokenization and caching. I recommend moving it up.

Fuse
^^^^

Linear chains of blockwise tasks are combined into a single task, minimizing scheduler overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this could use a brief example. It's not immediately clear to me.

Also, I thought that there was a global pass around deduplication. Is that no longer happening?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are referring to. Nothing is explicitly deduplicated since everything is implicitly deduplicated by the tokens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not actual deduplication at the node level, but things like reading the same parquet file twice, after we push down columns onto a read_parquet expression. My understanding is that there was a non-traversal optimization pass. Something that looked at things globally (my knowledge is out of date though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no global optimization step. this is all down with simplify_up

@fjetter fjetter changed the title Internals documentation for expression system Btter documentation for expression system Apr 29, 2025
@fjetter fjetter changed the title Btter documentation for expression system Better documentation for expression system Apr 29, 2025
@fjetter fjetter requested a review from Copilot April 29, 2025 10:25
Copy link
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves documentation for the expression system internals and updates related symbol exports.

  • Updated documentation in dask/_expr.py with new docstrings for dask_keys, _layer, and dask_graph.
  • Added an "Alias" symbol export in dask/task_spec.py.
  • Updated dask/init.py to import new expression-related symbols.

Reviewed Changes

Copilot reviewed 3 out of 8 changed files in this pull request and generated no comments.

File Description
dask/task_spec.py Added import and export for the Alias symbol.
dask/_expr.py Introduced dask_keys and updated docstrings in _layer and dask_graph.
dask/init.py Imported new expression-related classes from _expr.
Files not reviewed (5)
  • docs/source/custom-collections.rst: Language not supported
  • docs/source/expr-system-internals.rst: Language not supported
  • docs/source/graphs.rst: Language not supported
  • docs/source/internals.rst: Language not supported
  • docs/source/spec.rst: Language not supported
Comments suppressed due to low confidence (1)

dask/_expr.py:277

  • [nitpick] Consider renaming the variable 'name' to a more descriptive identifier like 'key' to improve readability in the task creation comprehension within the _layer method.
for i, name in enumerate(self.__dask_keys__())

@fjetter fjetter merged commit 4914f2f into dask:main May 5, 2025
24 checks passed
@astrojuanlu
Copy link

Hi folks, hope you don't mind I drop a question here - I was just reading the spec docs and noticed they were modified by this PR 6 hours ago 😄

Would love to know more about the changes here. Are these the result of the work that started with https://blog.dask.org/2023/08/25/dask-expr-introduction ? Is there any writeup on why the legacy "plain" dictionary representation of the graph wasn't enough? Does the new thing fundamentally change how indirect dependencies are expressed?

@astrojuanlu
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3FAD
Development

Successfully merging this pull request may close these issues.

3 participants
0