Remove legacy DataFrame implementation #11606

phofl · 2024-12-17T11:59:19Z

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

TODOs:

test cleanup (followup PR)

…mentation # Conflicts: # dask/dataframe/__init__.py

…e-legacy-implementation # Conflicts: # dask/dataframe/io/tests/test_orc.py

github-actions · 2024-12-17T12:41:00Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

15 files ± 0 15 suites ±0 3h 59m 0s ⏱️ + 1m 31s
12 909 tests - 370 11 445 ✅ - 772 1 464 💤 + 402 0 ❌ ±0
160 475 runs - 4 296 140 143 ✅ - 1 698 20 332 💤 - 2 598 0 ❌ ±0

Results for commit 4bd27f3. ± Comparison against base commit efd159b.

This pull request removes 377 and adds 7 tests. Note that renamed tests count towards both.

dask.dataframe.io.tests.test_csv ‑ test_getitem_optimization_after_filter
dask.dataframe.io.tests.test_demo ‑ test_make_timeseries_blockwise
dask.dataframe.io.tests.test_hdf ‑ test_to_hdf_link_optimizations
dask.dataframe.io.tests.test_io ‑ test_from_delayed_optimize_fusion
dask.dataframe.io.tests.test_parquet ‑ test_append[fastparquet-False]
dask.dataframe.io.tests.test_parquet ‑ test_append[fastparquet-True]
dask.dataframe.io.tests.test_parquet ‑ test_append_cat_fp[fastparquet]
dask.dataframe.io.tests.test_parquet ‑ test_append_create[fastparquet]
dask.dataframe.io.tests.test_parquet ‑ test_append_different_columns[fastparquet-False]
dask.dataframe.io.tests.test_parquet ‑ test_append_different_columns[fastparquet-True]
…

dask.tests.test_distributed ‑ test_blockwise_dataframe_io[False-False-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[False-None-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[False-True-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[True-False-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[True-None-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[True-True-parquet]
dask.tests.test_distributed ‑ test_fused_blockwise_dataframe_merge

This pull request skips 405 tests.

dask.array.tests.test_array_core ‑ test_map_blocks_series
dask.bytes.tests.test_http ‑ test_parquet[fastparquet]
dask.bytes.tests.test_s3 ‑ test_parquet[fastparquet-False]
dask.bytes.tests.test_s3 ‑ test_parquet[fastparquet-True]
dask.bytes.tests.test_s3 ‑ test_parquet_append[fastparquet]
dask.bytes.tests.test_s3 ‑ test_parquet_wstoragepars[fastparquet]
dask.dataframe.io.tests.test_csv ‑ test_text_blocks_to_pandas_simple[read_csv-files0]
dask.dataframe.io.tests.test_csv ‑ test_text_blocks_to_pandas_simple[read_fwf-files2]
dask.dataframe.io.tests.test_csv ‑ test_text_blocks_to_pandas_simple[read_table-files1]
dask.dataframe.io.tests.test_io ‑ test_from_delayed_preserves_hlgs
…

♻️ This comment has been updated with latest results.

…mentation # Conflicts: # dask/dataframe/io/io.py # dask/dataframe/tests/test_dataframe.py

continuous_integration/environment-3.10.yaml

continuous_integration/environment-3.11.yaml

continuous_integration/environment-3.12.yaml

continuous_integration/environment-3.13.yaml

continuous_integration/environment-mindeps-optional.yaml

docs/requirements-docs.txt

hendrikmakait

Thanks, @phofl. LGM as long as CI is green. One nit, the rest are reminders to revert changes to the linked Github repos.

hendrikmakait · 2024-12-20T11:18:39Z

continuous_integration/scripts/test_imports.sh

@@ -11,6 +11,9 @@ test_import () {
        # dask[distributed] depends on the latest version of distributed
        python -m pip install git+https://github.com/dask/distributed
    fi
+    if [[ $1 =~ "pandas dask-expr" ]]; then


TODO: Revert these changes

That one needs to stay, otherwise we get the latest release which will fail

continuous_integration/environment-3.11.yaml

continuous_integration/environment-3.12.yaml

hendrikmakait · 2024-12-20T11:23:21Z

dask/dataframe/__init__.py

    use_dask_expr = dask.config.get("dataframe.query-planning")
+    if use_dask_expr is False:
+        raise NotImplementedError("The legacy implementation is no longer supported")


nit: ValueError might be more applicable here.

continuous_integration/environment-3.13.yaml

continuous_integration/environment-mindeps-optional.yaml

rjzamora · 2024-12-20T19:16:06Z

Wow, this is a very big PR! Hopefully it wont take too long for us to find/fix all the broken edges in rapids :)

Thanks for working on this @phofl - I know it's challenging to do a migration like this.

Pin to Dask 2024.12.1, since dask/dask#11606 broke RAPIDS. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Richard (Rick) Zamora (https://github.com/rjzamora) - https://github.com/jakirkham URL: #77

phofl · 2025-01-02T21:41:06Z

@rjzamora any chance you can provide a list of things you import from dask.dataframe.foo so that we can think if providing a proper downstream packages API makes sense? I suspect that you are also importing stuff from dask_expr, which will go away soon and thus require another update on your end? Putting all of this in some folder that we know that we shouldn't move in the future would make things easier

rjzamora · 2025-01-05T14:49:27Z

any chance you can provide a list of things you import from dask.dataframe.foo so that we can think if providing a proper downstream packages API makes sense?

Yes, I can definitely put this together. Thanks for the suggestion!

I suspect that you are also importing stuff from dask_expr, which will go away soon and thus require another update on your end? Putting all of this in some folder that we know that we shouldn't move in the future would make things easier

Yes, we definitely import (and patch) a lot of stuff from dask_expr. I was expecting RAPIDS would need to pin and revise after dask-expr was "mopved" into dask/dask, but it definitely makes sense for us to share the specific APIs we want to treat as "public".

rjzamora · 2025-01-07T20:37:08Z

Just following up... I compiled a rough list of relevant imports within Dask cuDF and Dask-CUDA:

Imports from Dask DataFrame

dd._dask_expr_enabled  # Was using dd.DASK_EXPR_ENABLED before 11606
dd.DataFrame
dd.Series
dd.concat
dd.from_delayed
dd.from_map
dd.from_pandas
dd.read_csv
dd.read_json
dd.read_orc
dd.read_parquet
dd.to_parquet

dd.backends.DataFrameBackendEntrypoint
dd.backends.PandasBackendEntrypoint

dd.core.is_dataframe_like
dd.core._concat

dd.dispatch.categorical_dtype_dispatch
dd.dispatch.concat
dd.dispatch.concat_dispatch
dd.backends.concat_pandas
dd.dispatch.from_pyarrow_table_dispatch
dd.dispatch.get_parallel_type
dd.dispatch.group_split_dispatch
dd.dispatch.grouper_dispatch
dd.dispatch.hash_object_dispatch
dd.dispatch.is_categorical_dtype
dd.dispatch.is_categorical_dtype_dispatch
dd.dispatch.make_meta
dd.dispatch.make_meta_dispatch
dd.dispatch.meta_nonempty
dd.dispatch.partd_encode_dispatch
dd.dispatch.pyarrow_schema_dispatch
dd.dispatch.to_pyarrow_table_dispatch
dd.dispatch.tolist_dispatch
dd.dispatch.union_categoricals_dispatch

dd.groupby.Aggregation

dd.io.csv.make_reader
dd.io.parquet.arrow.ArrowDatasetEngine
dd.io.parquet.arrow._filters_to_expression
dd.io.parquet.core.ParquetFunctionWrapper
dd.io.parquet.create_metadata_file
dd.io.utils._get_pyarrow_dtypes
dd.io.utils._is_local_fs

dd.shuffle.partitioning_index
dd.shuffle.shuffle_group

dd.utils.UNKNOWN_CATEGORIES
dd.utils._nonempty_scalar
dd.utils._scalar_from_dtype
dd.utils.assert_eq
dd.utils.is_dataframe_like
dd.utils.is_series_like
dd.utils.is_index_like
dd.utils.make_meta_obj
dd.utils.make_scalar
dd.utils.pyarrow_strings_enabled

I don't think any of these dask.dataframe imports are a real problem (though I could be missing something).

One problem caused by this PR is that we have been using dd.DASK_EXPR_ENABLED to check if query-planning is enabled in dask.dataframe. Now that DASK_EXPR_ENABLED is gone, we are stuck using _dask_expr_enabled until we decide to pin to the latest Dask version.

Imports from Dask Expressions

dx.from_dict
dx.get_collection_type
dx.new_collection
dx.DataFrame
dx.FrameBase
dx.Index
dx.Series

dx.io.csv.ReadCSV
dx.io.io.FusedIO
dx.io.io.FusedParquetIO
dx.io.parquet.FragmentWrapper
dx.io.parquet.ReadParquetFSSpec
dx.io.parquet.ReadParquetPyarrowFS

dx._cumulative.CumulativeBlockwise

dx._expr.Elemwise
dx._expr.Expr
dx._expr.RenameAxis
dx._expr.VarColumns

dx._groupby.DecomposableGroupbyAggregation
dx._groupby.GroupBy
dx._groupby.GroupbyAggregation
dx._groupby.SeriesGroupBy
dx._groupby.SingleAggregation

dx._reductions.Reduction
dx._reductions.Var

dx._shuffle._get_divisions
dx._shuffle.Shuffle
dx._shuffle.TaskShuffle

dx._util._convert_to_list
dx._util._raise_if_object_series
dx._util.is_scalar

The dask_expr imports are certainly more of a mess, because most of dask_expr is private, but Dask cuDF needs to use a lot of that private API to deal with differences between pandas and cuDF.

phofl · 2025-01-10T15:07:02Z

So dispatch and backends is exposed anyway, so this should be fine.

For everything else, I'd rather not guarantee that the path doesn't change. I added an api.py file in dask.dataframe that we can use to expose the stuff that other libraries need (underscore functions should probably be renamed to something without underscores and get docstrings if not already there).

Can you move the stuff over and put up a PR when you are adapting stuff over in dask-cudf?

martindurant · 2025-01-21T20:23:37Z

A deprecation cycle or advance warning would have been nice. dask-awkward and dask-histogram are currently broken.
dask-contrib/dask-awkward#566

phofl · 2025-01-22T10:19:24Z

The legacy implementation was deprecated in 2024.11.0

phofl added 9 commits December 16, 2024 15:02

Disable legacy dataframe tests in CI

7a39852

Fixup tests

a22ddc2

Merge branch 'main' into disable-legacy-tests

91c3934

Update dask.yaml

4e558ee

Remove legacy DataFrame implementation

f80dfdd

Merge remote-tracking branch 'upstream/main' into remove-legacy-imple…

44f9b0b

…mentation # Conflicts: # dask/dataframe/__init__.py

Fixup csv

e50a0f0

Merge remote-tracking branch 'origin/disable-legacy-tests' into remov…

e494bcb

…e-legacy-implementation # Conflicts: # dask/dataframe/io/tests/test_orc.py

Install proper dask-expr

b3110a9

phofl added 11 commits December 17, 2024 15:35

Fix imports

7005b5d

Fixup tests

00788cf

Update deps

ec25bd1

Remove from docs

71469c3

Fixup

8642fea

Try fix import errors

a5855da

Fixup last test

0d52339

Fixup imports

3236373

Fixup imports

9c85070

Try building docs with dask-expr

e3f922e

Try building docs with dask-expr

76321bb

phofl mentioned this pull request Dec 17, 2024

Remove legacy DataFrame implementation dask/distributed#8968

Merged

2 tasks

phofl and others added 2 commits December 20, 2024 11:33

Merge remote-tracking branch 'upstream/main' into remove-legacy-imple…

f00db9e

…mentation # Conflicts: # dask/dataframe/io/io.py # dask/dataframe/tests/test_dataframe.py

Merge branch 'main' into remove-legacy-implementation

8c37bd3

hendrikmakait reviewed Dec 20, 2024

View reviewed changes

continuous_integration/environment-3.10.yaml Outdated Show resolved Hide resolved

phofl commented Dec 20, 2024

View reviewed changes

continuous_integration/environment-3.10.yaml Outdated Show resolved Hide resolved

Update continuous_integration/environment-3.10.yaml

ed0542f

phofl commented Dec 20, 2024

View reviewed changes

continuous_integration/environment-3.11.yaml Outdated Show resolved Hide resolved

phofl commented Dec 20, 2024

View reviewed changes

continuous_integration/environment-3.12.yaml Outdated Show resolved Hide resolved

phofl commented Dec 20, 2024

View reviewed changes

continuous_integration/environment-3.13.yaml Outdated Show resolved Hide resolved

phofl added 5 commits December 20, 2024 12:21

Update continuous_integration/environment-mindeps-optional.yaml

39fbe0f

Update continuous_integration/environment-3.12.yaml

8000
efd78e5

Update continuous_integration/environment-3.11.yaml

756865a

Update continuous_integration/environment-3.13.yaml

be5e0a1

Update continuous_integration/environment-mindeps-dataframe.yaml

f61a748

phofl commented Dec 20, 2024

View reviewed changes

continuous_integration/environment-mindeps-optional.yaml Outdated Show resolved Hide resolved

Update continuous_integration/environment-mindeps-optional.yaml

7086324

phofl commented Dec 20, 2024

View reviewed changes

docs/requirements-docs.txt Outdated Show resolved Hide resolved

Update docs/requirements-docs.txt

4bd27f3

hendrikmakait approved these changes Dec 20, 2024

View reviewed changes

hendrikmakait reviewed Dec 20, 2024

View reviewed changes

continuous_integration/environment-3.13.yaml Outdated Show resolved Hide resolved

hendrikmakait reviewed Dec 20, 2024

View reviewed changes

continuous_integration/environment-mindeps-optional.yaml Outdated Show resolved Hide resolved

phofl added 5 commits December 20, 2024 13:10

Update continuous_integration/environment-mindeps-optional.yaml

c0c00a0

Update continuous_integration/environment-3.13.yaml

7ce5a45

Update continuous_integration/environment-3.12.yaml

1bb9b34

Update continuous_integration/environment-3.11.yaml

3afbf21

Update continuous_integration/environment-3.10.yaml

e099572

phofl merged commit 3eff29c into dask:main Dec 20, 2024
23 of 24 checks passed

phofl deleted the remove-legacy-implementation branch December 20, 2024 12:16

pentschev mentioned this pull request Dec 20, 2024

Pin to Dask 2024.12.1 rapidsai/rapids-dask-dependency#77

Merged

kratsg mentioned this pull request Jan 24, 2025

uproot not compatible with dask>=2025 scikit-hep/uproot5#1366

Closed

jdblischak mentioned this pull request Mar 10, 2025

Add upper bounds to fix 0.2.3 build conda-forge/spatialdata-feedstock#17

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove legacy DataFrame implementation #11606

Remove legacy DataFrame implementation #11606

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Remove legacy DataFrame implementation #11606

Remove legacy DataFrame implementation #11606

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!