8000 [Datasets] Last-mile preprocessing docs. by clarkzinzow · Pull Request #20712 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Datasets] Last-mile preprocessing docs. #20712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

clarkzinzow
Copy link
Contributor

Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

.. code-block:: python

# Impute missing values with the column mean.
b_mean = ds.mean("B")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will currently fail if nulls exist in column "B", since our aggregations aren't resilient to null values. I'm going to submit a sibling PR that makes our statistical aggregation resilient to nulls by default.

# -> [6, 2, ..., 4]

# Scales to terabytes of data with the same simple API.
ds = ray.data.read_parquet("s3://ursa-labs-taxi-data") # open, tabular, NYC taxi dataset
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided that a real, open dataset would be good in case users are wanting to quickly try it out.

@clarkzinzow clarkzinzow force-pushed the datasets/docs/last-mile-preprocessing branch from 88f2ae8 to dd1ff43 Compare November 25, 2021 03:20
@@ -327,6 +327,135 @@ By default, transformations are executed using Ray tasks. For transformations th
# Save the results.
ds.repartition(1).write_json("s3://bucket/inference-results")

Last-mile preprocessing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we move this into a "Dataset ML preprocessing" page before/after the tensor support one? (the overview page is getting too large, and also should be split up later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl Can we do that in a future PR? We can do that as a docs polish PR during the GA push. A big motivation for this PR is to get groupby + aggregations representation in the 1.9 docs, particularly in our overview/getting started/user guide page, and I think reorganizing these sections into separate pages is out of scope for that.

@ericl ericl self-assigned this Nov 29, 2021
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 29, 2021
@ericl
Copy link
Contributor
ericl commented Nov 30, 2021 via email

@clarkzinzow
Copy link
Contributor Author
clarkzinzow commented Nov 30, 2021

@ericl Getting groupby + aggregations into our existing overview docs page before the 1.9 release is a P0 verging on a release blocker, while reorganizing the overview docs page is a P1 in my mind and I don't want that to block this from getting merged. Do you disagree with that prioritization?

I can break the overview docs page into multiple pages tomorrow in a follow-up PR, just don't want to block the release on that.

@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@clarkzinzow clarkzinzow assigned ericl, scv119 and worldveil and unassigned ericl Nov 30, 2021
@clarkzinzow clarkzinzow force-pushed the datasets/docs/last-mile-preprocessing branch from ceba109 to 2f230e2 Compare November 30, 2021 05:25
@ericl
Copy link
Contributor
ericl commented Nov 30, 2021

@clarkzinzow , I'm suggesting moving the new section to a new page, not re-organizing the docs. I think this will be a strict improvement over the current PR, which makes the overview too long / detailed.

We can revisit a more holistic refactoring separately.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@clarkzinzow
Copy link
Contributor Author

@ericl Done, please review.

@clarkzinzow clarkzinzow force-pushed the datasets/docs/last-mile-preprocessing branch from 51ae29c to 654f202 Compare November 30, 2021 06:05
Copy link
Contributor
@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will 62FB be displayed to describe this comment to others. Learn more.

Looks good, but could we call it ML preprocessing instead of last mile? I think that will be the more common search term.

@clarkzinzow
Copy link
Contributor Author

Looks good, but could we call it ML preprocessing instead of last mile? I think that will be the more common search term.

Sure, no concerns that removing last-mile will confuse our positioning?

@fishbone
Copy link
Contributor

Merge it since the doc has been built.

@fishbone fishbone merged commit b872fda into ray-project:master Nov 30, 2021
fishbone pushed a commit that referenced this pull request Nov 30, 2021
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
fishbone pushed a commit that referenced this pull request Nov 30, 2021
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0