[Datasets] Last-mile preprocessing docs. #20712

clarkzinzow · 2021-11-25T03:07:28Z

Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

clarkzinzow · 2021-11-25T03:08:57Z

doc/source/data/dataset.rst

+.. code-block:: python
+
+    # Impute missing values with the column mean.
+    b_mean = ds.mean("B")


Note that this will currently fail if nulls exist in column "B", since our aggregations aren't resilient to null values. I'm going to submit a sibling PR that makes our statistical aggregation resilient to nulls by default.

doc/source/data/dataset.rst

clarkzinzow · 2021-11-25T03:11:43Z

doc/source/data/dataset.rst

+    # -> [6, 2, ..., 4]
+
+    # Scales to terabytes of data with the same simple API.
+    ds = ray.data.read_parquet("s3://ursa-labs-taxi-data")  # open, tabular, NYC taxi dataset


Decided that a real, open dataset would be good in case users are wanting to quickly try it out.

doc/source/data/dataset.rst

ericl · 2021-11-29T19:45:39Z

doc/source/data/dataset.rst

@@ -327,6 +327,135 @@ By default, transformations are executed using Ray tasks. For transformations th
    # Save the results.
    ds.repartition(1).write_json("s3://bucket/inference-results")

+Last-mile preprocessing


How about we move this into a "Dataset ML preprocessing" page before/after the tensor support one? (the overview page is getting too large, and also should be split up later)

@ericl Can we do that in a future PR? We can do that as a docs polish PR during the GA push. A big motivation for this PR is to get groupby + aggregations representation in the 1.9 docs, particularly in our overview/getting started/user guide page, and I think reorganizing these sections into separate pages is out of scope for that.

ericl · 2021-11-30T04:10:59Z

I'd like to start putting these advanced ops in separate pages. You can link to it from the overview.

…

On Mon, Nov 29, 2021, 8:06 PM Clark Zinzow ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In doc/source/data/dataset.rst <#20712 (comment)>: > @@ -327,6 +327,135 @@ By default, transformations are executed using Ray tasks. For transformations th # Save the results. ds.repartition(1).write_json("s3://bucket/inference-results") +Last-mile preprocessing @ericl <https://github.com/ericl> Can we do that in a future PR? We can do that as a docs polish PR during the GA push. A big motivation for this PR is to get groupby + aggregations representation in the 1.9 docs, particularly in our overview/getting started/user guide page, and I think reorganizing these sections into separate pages is out of scope for that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSSRVRREPIZEMF6GJPDUOREVBANCNFSM5IXPPKRQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

clarkzinzow · 2021-11-30T04:36:08Z

@ericl Getting groupby + aggregations into our existing overview docs page before the 1.9 release is a P0 verging on a release blocker, while reorganizing the overview docs page is a P1 in my mind and I don't want that to block this from getting merged. Do you disagree with that prioritization?

I can break the overview docs page into multiple pages tomorrow in a follow-up PR, just don't want to block the release on that.

…puts.

ericl · 2021-11-30T05:34:11Z

@clarkzinzow , I'm suggesting moving the new section to a new page, not re-organizing the docs. I think this will be a strict improvement over the current PR, which makes the overview too long / detailed.

We can revisit a more holistic refactoring separately.

clarkzinzow · 2021-11-30T06:03:57Z

@ericl Done, please review.

ericl

Looks good, but could we call it ML preprocessing instead of last mile? I think that will be the more common search term.

clarkzinzow · 2021-11-30T06:16:39Z

Looks good, but could we call it ML preprocessing instead of last mile? I think that will be the more common search term.

Sure, no concerns that removing last-mile will confuse our positioning?

fishbone · 2021-11-30T07:22:42Z

Merge it since the doc has been built.

Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.

clarkzinzow requested review from ericl and scv119 as code owners November 25, 2021 03:07

clarkzinzow commented Nov 25, 2021

View reviewed changes