Better documentation for expression system #11915

fjetter · 2025-04-28T09:57:03Z

This is documenting the high level pieces of the expression system internals. I believe everything that is more detailed that this should be in-code documentation.

github-actions · 2025-04-28T10:27:51Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

9 files ±0 9 suites ±0 3h 20m 31s ⏱️ + 2m 48s
18 016 tests ±0 16 802 ✅ ±0 1 214 💤 ±0 0 ❌ ±0
161 218 runs ±0 149 112 ✅ - 2 12 106 💤 +2 0 ❌ ±0

Results for commit 1cd7023. ± Comparison against base commit 01a6a37.

♻️ This comment has been updated with latest results.

fjetter · 2025-04-28T12:14:14Z

see #11918 for CI failures

fjetter · 2025-04-28T12:15:44Z

The docs build is failing with


Preparing metadata (setup.py): finished with status 'error'
--
254 | error: subprocess-exited-with-error
255 |  
256 | × python setup.py egg_info did not run successfully.
257 | │ exit code: 1
258 | ╰─> [1 lines of output]
259 | error in numpydoc setup command: "values of 'package_data' dict" must be of type <tuple[str, ...] \| list[str]> (got 'tests')
260 | [end of output]
261 |  
262 | note: This error originates from a subprocess, and is likely not a problem with pip.
263 | error: metadata-generation-failed
264 |  
265 | × Encountered error while generating package metadata.
266 | ╰─> See above for output.

This feels unrelated. I suspect this is related to the setuptools 80 release

mrocklin

Thanks for doing this @fjetter . Some high level comments

mrocklin · 2025-04-28T12:48:16Z

docs/source/expr-system-internals.rst

+
+.. currentmodule:: dask.dataframe
+
+The expression system was originally developed for Dask DataFrames, as implemented in the `dask-expr <https://github.com/dask/dask-expr>`_ project. Early prototypes experimented with `matchpy <https://github.com/HPAC/matchpy>`_, but it was soon replaced with a custom-built system offering simpler and more transparent optimization steps.


I recommend dropping matchpy (not sure that the history is relevant)

Maybe also point out the larger objective beyond just dataframes?

Maybe also good to mention motivations here (high level optimization, more compact representation to move around, etc..)

I don't intend to motivate why this is here. If anything, this document is supposed to explain a couple of gotchas around the system. If one is looking for motivation or high level introduction, we have https://docs.dask.org/en/stable/dataframe-optimizer.html

I think that motivating the expression system as a whole is still a good idea. The linked doc mostly says "we do optimizations and here's what they are". I think something higher level would be useful. Not critical though.

mrocklin · 2025-04-28T12:48:43Z

docs/source/expr-system-internals.rst

+Construction
+^^^^^^^^^^^^
+
+The expression system centers around the Expr class, which represents a computation on a Dask DataFrame. This class is designed for subclassing; each subclass corresponds to a specific computation type (e.g., arithmetic, logical operations, filtering, joins).


This looks duplicative of the paragraph above.

mrocklin · 2025-04-28T12:49:52Z

docs/source/expr-system-internals.rst

+Instead, expression classes use a dataclass-like interface defined by two attributes:
+
+* ``_parameters``: List of parameter names
+
+* ``_defaults``: Dictionary of default values for optional parameters


This stuff seems good, and probably more important than the note about custom initializers. I recommend leading with it, and probably with other structural parts of the expression system (children, expected methods, etc.) before we get into caveats.

I disagree about the importance. I don't intend to document methods in this document

Let me ask a broader question then, if someone wanted to write an Expression powered version of some other API (maybe Bag or something) what would they read in order to learn how to do that? Is there a doc already?

I don't know exactly how that works so it is difficult to write down documentation for that

You don't know how the expression system works, or how bag works?

I think you know how the expression system works. My request is that you write down the information you would want someone to know if they were to start trying to build a new API on top of it. Why might they want to do that? What are the key parts of the interface they should be aware of? etc..

I updated the old examples about collections and added one using Expressions

which also includes an overdue update regarding the task spec

docs/source/expr-system-internals.rst

mrocklin · 2025-04-28T12:52:20Z

docs/source/expr-system-internals.rst

+Optimization Procedure
+----------------------


This section seems more important to me than tokenization and caching. I recommend moving it up.

mrocklin · 2025-04-28T12:54:01Z

docs/source/expr-system-internals.rst

+Fuse
+^^^^
+
+Linear chains of blockwise tasks are combined into a single task, minimizing scheduler overhead.


I think that this could use a brief example. It's not immediately clear to me.

Also, I thought that there was a global pass around deduplication. Is that no longer happening?

I don't understand what you are referring to. Nothing is explicitly deduplicated since everything is implicitly deduplicated by the tokens.

Yeah, not actual deduplication at the node level, but things like reading the same parquet file twice, after we push down columns onto a read_parquet expression. My understanding is that there was a non-traversal optimization pass. Something that looked at things globally (my knowledge is out of date though).

there is no global optimization step. this is all down with simplify_up

Copilot

Pull Request Overview

This PR improves documentation for the expression system internals and updates related symbol exports.

Updated documentation in dask/_expr.py with new docstrings for dask_keys, _layer, and dask_graph.
Added an "Alias" symbol export in dask/task_spec.py.
Updated dask/init.py to import new expression-related symbols.

Reviewed Changes

Copilot reviewed 3 out of 8 changed files in this pull request and generated no comments.

File	Description
dask/task_spec.py	Added import and export for the Alias symbol.
dask/_expr.py	Introduced dask_keys and updated docstrings in _layer and dask_graph.
dask/init.py	Imported new expression-related classes from _expr.

Files not reviewed (5)

docs/source/custom-collections.rst: Language not supported
docs/source/expr-system-internals.rst: Language not supported
docs/source/graphs.rst: Language not supported
docs/source/internals.rst: Language not supported
docs/source/spec.rst: Language not supported

Comments suppressed due to low confidence (1)

dask/_expr.py:277

[nitpick] Consider renaming the variable 'name' to a more descriptive identifier like 'key' to improve readability in the task creation comprehension within the _layer method.

for i, name in enumerate(self.__dask_keys__())

astrojuanlu · 2025-05-05T14:06:03Z

Hi folks, hope you don't mind I drop a question here - I was just reading the spec docs and noticed they were modified by this PR 6 hours ago 😄

Would love to know more about the changes here. Are these the result of the work that started with https://blog.dask.org/2023/08/25/dask-expr-introduction ? Is there any writeup on why the legacy "plain" dictionary representation of the graph wasn't enough? Does the new thing fundamentally change how indirect dependencies are expressed?

astrojuanlu · 2025-05-20T05:44:30Z

Asked this in a more appropriate place https://dask.discourse.group/t/background-on-new-graph-specification/3949?u=astrojuanlu

mrocklin reviewed Apr 28, 2025

View reviewed changes

fjetter force-pushed the document-expr branch from 9c32d2e to 2490ecf Compare April 29, 2025 10:24

fjetter changed the title ~~Internals documentation for expression system~~ Btter documentation for expression system Apr 29, 2025

fjetter changed the title ~~Btter documentation for expression system~~ Better documentation for expression system Apr 29, 2025

fjetter requested a review from Copilot April 29, 2025 10:25

Copilot AI reviewed Apr 29, 2025

View reviewed changes

fjetter added 5 commits April 29, 2025 15:43

wrap up before polish

6f3e784

polish

3277bbd

minor formatting

637b9c7

remove dup

975bee3

expand documentation

7a0924b

fjetter force-pushed the document-expr branch from 2490ecf to 7a0924b Compare April 29, 2025 13:43

Merge branch 'main' into document-expr

1cd7023

fjetter merged commit 4914f2f into dask:main May 5, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Better documentation for expression system #11915

Better documentation for expression system #11915

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		.. currentmodule:: dask.dataframe

		The expression system was originally developed for Dask DataFrames, as implemented in the `dask-expr <https://github.com/dask/dask-expr>`_ project. Early prototypes experimented with `matchpy <https://github.com/HPAC/matchpy>`_, but it was soon replaced with a custom-built system offering simpler and more transparent optimization steps.

		Optimization Procedure
		----------------------

Uh oh!

Better documentation for expression system #11915

Better documentation for expression system #11915

Uh oh!

Conversation

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!