8000 Remove legacy DataFrame implementation by phofl · Pull Request #11606 · dask/dask · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Remove legacy DataFrame implementation #11606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Dec 20, 2024

Conversation

phofl
Copy link
Collaborator
@phofl phofl commented Dec 17, 2024
  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

TODOs:

  • test cleanup (followup PR)

Copy link
Contributor
github-actions bot commented Dec 17, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ±    0       15 suites  ±0   3h 59m 0s ⏱️ + 1m 31s
 12 909 tests  -   370   11 445 ✅  -   772   1 464 💤 +  402  0 ❌ ±0 
160 475 runs   - 4 296  140 143 ✅  - 1 698  20 332 💤  - 2 598  0 ❌ ±0 

Results for commit 4bd27f3. ± Comparison against base commit efd159b.

This pull request removes 377 and adds 7 tests. Note that renamed tests count towards both.
dask.dataframe.io.tests.test_csv ‑ test_getitem_optimization_after_filter
dask.dataframe.io.tests.test_demo ‑ test_make_timeseries_blockwise
dask.dataframe.io.tests.test_hdf ‑ test_to_hdf_link_optimizations
dask.dataframe.io.tests.test_io ‑ test_from_delayed_optimize_fusion
dask.dataframe.io.tests.test_parquet ‑ test_append[fastparquet-False]
dask.dataframe.io.tests.test_parquet ‑ test_append[fastparquet-True]
dask.dataframe.io.tests.test_parquet ‑ test_append_cat_fp[fastparquet]
dask.dataframe.io.tests.test_parquet ‑ test_append_create[fastparquet]
dask.dataframe.io.tests.test_parquet ‑ test_append_different_columns[fastparquet-False]
dask.dataframe.io.tests.test_parquet ‑ test_append_different_columns[fastparquet-True]
…
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[False-False-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[False-None-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[False-True-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[True-False-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[True-None-parquet]
dask.tests.test_distributed ‑ test_blockwise_dataframe_io[True-True-parquet]
dask.tests.test_distributed ‑ test_fused_blockwise_dataframe_merge
This pull request skips 405 tests.
dask.array.tests.test_array_core ‑ test_map_blocks_series
dask.bytes.tests.test_http ‑ test_parquet[fastparquet]
dask.bytes.tests.test_s3 ‑ test_parquet[fastparquet-False]
dask.bytes.tests.test_s3 ‑ test_parquet[fastparquet-True]
dask.bytes.tests.test_s3 ‑ test_parquet_append[fastparquet]
dask.bytes.tests.test_s3 ‑ test_parquet_wstoragepars[fastparquet]
dask.dataframe.io.tests.test_csv ‑ test_text_blocks_to_pandas_simple[read_csv-files0]
dask.dataframe.io.tests.test_csv ‑ test_text_blocks_to_pandas_simple[read_fwf-files2]
dask.dataframe.io.tests.test_csv ‑ test_text_blocks_to_pandas_simple[read_table-files1]
dask.dataframe.io.tests.test_io ‑ test_from_delayed_preserves_hlgs
…

♻️ This comment has been updated with latest results.

phofl and others added 2 commits December 20, 2024 11:33
…mentation

# Conflicts:
#	dask/dataframe/io/io.py
#	dask/dataframe/tests/test_dataframe.py
Copy link
Member
@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @phofl. LGM as long as CI is green. One nit, the rest are reminders to revert changes to the linked Github repos.

@@ -11,6 +11,9 @@ test_import () {
# dask[distributed] depends on the latest version of distributed
python -m pip install git+https://github.com/dask/distributed
fi
if [[ $1 =~ "pandas dask-expr" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Revert these changes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one needs to stay, otherwise we get the latest release which will fail

use_dask_expr = dask.config.get("dataframe.query-planning")
if use_dask_expr is False:
raise NotImplementedError("The legacy implementation is no longer supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ValueError might be more applicable here.

@phofl phofl merged commit 3eff29c into dask:main Dec 20, 2024
23 of 24 checks passed
@phofl phofl deleted the remove-legacy-implementation branch December 20, 2024 12:16
@rjzamora
Copy link
Member
rjzamora commented Dec 20, 2024

Wow, this is a very big PR! Hopefully it wont take too long for us to find/fix all the broken edges in rapids :)

Thanks for working on this @phofl - I know it's challenging to do a migration like this.

rapids-bot bot pushed a commit to rapidsai/rapids-dask-dependency that referenced this pull request Dec 20, 2024
Pin to Dask 2024.12.1, since dask/dask#11606 broke RAPIDS.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - https://github.com/jakirkham

URL: #77
@phofl
Copy link
Collaborator Author
phofl commented Jan 2, 2025

@rjzamora any chance you can provide a list of things you import from dask.dataframe.foo so that we can think if providing a proper downstream packages API makes sense? I suspect that you are also importing stuff from dask_expr, which will go away soon and thus require another update on your end? Putting all of this in some folder that we know that we shouldn't move in the future would make things easier

@rjzamora
Copy link
Member
rjzamora commented Jan 5, 2025

any chance you can provide a list of things you import from dask.dataframe.foo so that we can think if providing a proper downstream packages API makes sense?

Yes, I can definitely put this together. Thanks for the suggestion!

I suspect that you are also importing stuff from dask_expr, which will go away soon and thus require another update on your end? Putting all of this in some folder that we know that we shouldn't move in the future would make things easier

Yes, we definitely import (and patch) a lot of stuff from dask_expr. I was expecting RAPIDS would need to pin and revise after dask-expr was "mopved" into dask/dask, but it definitely makes sense for us to share the specific APIs we want to treat as "public".

@rjzamora
Copy link
Member
rjzamora commented Jan 7, 2025

Just following up... I compiled a rough list of relevant imports within Dask cuDF and Dask-CUDA:

Imports from Dask DataFrame
dd._dask_expr_enabled  # Was using dd.DASK_EXPR_ENABLED before 11606
dd.DataFrame
dd.Series
dd.concat
dd.from_delayed
dd.from_map
dd.from_pandas
dd.read_csv
dd.read_json
dd.read_orc
dd.read_parquet
dd.to_parquet

dd.backends.DataFrameBackendEntrypoint
dd.backends.PandasBackendEntrypoint

dd.core.is_dataframe_like
dd.core._concat

dd.dispatch.categorical_dtype_dispatch
dd.dispatch.concat
dd.dispatch.concat_dispatch
dd.backends.concat_pandas
dd.dispatch.from_pyarrow_table_dispatch
dd.dispatch.get_parallel_type
dd.dispatch.group_split_dispatch
dd.dispatch.grouper_dispatch
dd.dispatch.hash_object_dispatch
dd.dispatch.is_categorical_dtype
dd.dispatch.is_categorical_dtype_dispatch
dd.dispatch.make_meta
dd.dispatch.make_meta_dispatch
dd.dispatch.meta_nonempty
dd.dispatch.partd_encode_dispatch
dd.dispatch.pyarrow_schema_dispatch
dd.dispatch.to_pyarrow_table_dispatch
dd.dispatch.tolist_dispatch
dd.dispatch.union_categoricals_dispatch

dd.groupby.Aggregation

dd.io.csv.make_reader
dd.io.parquet.arrow.ArrowDatasetEngine
dd.io.parquet.arrow._filters_to_expression
dd.io.parquet.core.ParquetFunctionWrapper
dd.io.parquet.create_metadata_file
dd.io.utils._get_pyarrow_dtypes
dd.io.utils._is_local_fs

dd.shuffle.partitioning_index
dd.shuffle.shuffle_group

dd.utils.UNKNOWN_CATEGORIES
dd.utils._nonempty_scalar
dd.utils._scalar_from_dtype
dd.utils.assert_eq
dd.utils.is_dataframe_like
dd.utils.is_series_like
dd.utils.is_index_like
dd.utils.make_meta_obj
dd.utils.make_scalar
dd.utils.pyarrow_strings_enabled

I don't think any of these dask.dataframe imports are a real problem (though I could be missing something).

One problem caused by this PR is that we have been using dd.DASK_EXPR_ENABLED to check if query-planning is enabled in dask.dataframe. Now that DASK_EXPR_ENABLED is gone, we are stuck using _dask_expr_enabled until we decide to pin to the latest Dask version.

Imports from Dask Expressions
dx.from_dict
dx.get_collection_type
dx.new_collection
dx.DataFrame
dx.FrameBase
dx.Index
dx.Series

dx.io.csv.ReadCSV
dx.io.io.FusedIO
dx.io.io.FusedParquetIO
dx.io.parquet.FragmentWrapper
dx.io.parquet.ReadParquetFSSpec
dx.io.parquet.ReadParquetPyarrowFS

dx._cumulative.CumulativeBlockwise

dx._expr.Elemwise
dx._expr.Expr
dx._expr.RenameAxis
dx._expr.VarColumns

dx._groupby.DecomposableGroupbyAggregation
dx._groupby.GroupBy
dx._groupby.GroupbyAggregation
dx._groupby.SeriesGroupBy
dx._groupby.SingleAggregation

dx._reductions.Reduction
dx._reductions.Var

dx._shuffle._get_divisions
dx._shuffle.Shuffle
dx._shuffle.TaskShuffle

dx._util._convert_to_list
dx._util._raise_if_object_series
dx._util.is_scalar

The dask_expr imports are certainly more of a mess, because most of dask_expr is private, but Dask cuDF needs to use a lot of that private API to deal with differences between pandas and cuDF.

@phofl
Copy link
Collaborator Author
phofl commented Jan 10, 2025

So dispatch and backends is exposed anyway, so this should be fine.

For everything else, I'd rather not guarantee that the path doesn't change. I added an api.py file in dask.dataframe that we can use to expose the stuff that other libraries need (underscore functions should probably be renamed to something without underscores and get docstrings if not already there).

Can you move the stuff over and put up a PR when you are adapting stuff over in dask-cudf?

@martindurant
Copy link
Member

A deprecation cycle or advance warning would have been nice. dask-awkward and dask-histogram are currently broken.
dask-contrib/dask-awkward#566

@phofl
Copy link
Collaborator Author
phofl commented Jan 22, 2025

The legacy implementation was deprecated in 2024.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0