-
Notifications
You must be signed in to change notification settings - Fork 6.4k
[Datasets] Last-mile preprocessing docs. #20712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Last-mile preprocessing docs. #20712
Conversation
doc/source/data/dataset.rst
Outdated
.. code-block:: python | ||
|
||
# Impute missing values with the column mean. | ||
b_mean = ds.mean("B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this will currently fail if nulls exist in column "B"
, since our aggregations aren't resilient to null values. I'm going to submit a sibling PR that makes our statistical aggregation resilient to nulls by default.
doc/source/data/dataset.rst
Outdated
# -> [6, 2, ..., 4] | ||
|
||
# Scales to terabytes of data with the same simple API. | ||
ds = ray.data.read_parquet("s3://ursa-labs-taxi-data") # open, tabular, NYC taxi dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided that a real, open dataset would be good in case users are wanting to quickly try it out.
88f2ae8
to
dd1ff43
Compare
doc/source/data/dataset.rst
Outdated
@@ -327,6 +327,135 @@ By default, transformations are executed using Ray tasks. For transformations th | |||
# Save the results. | |||
ds.repartition(1).write_json("s3://bucket/inference-results") | |||
|
|||
Last-mile preprocessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we move this into a "Dataset ML preprocessing" page before/after the tensor support one? (the overview page is getting too large, and also should be split up later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl Can we do that in a future PR? We can do that as a docs polish PR during the GA push. A big motivation for this PR is to get groupby + aggregations representation in the 1.9 docs, particularly in our overview/getting started/user guide page, and I think reorganizing these sections into separate pages is out of scope for that.
I'd like to start putting these advanced ops in separate pages. You can
link to it from the overview.
…On Mon, Nov 29, 2021, 8:06 PM Clark Zinzow ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/source/data/dataset.rst
<#20712 (comment)>:
> @@ -327,6 +327,135 @@ By default, transformations are executed using Ray tasks. For transformations th
# Save the results.
ds.repartition(1).write_json("s3://bucket/inference-results")
+Last-mile preprocessing
@ericl <https://github.com/ericl> Can we do that in a future PR? We can
do that as a docs polish PR during the GA push. A big motivation for this
PR is to get groupby + aggregations representation in the 1.9 docs,
particularly in our overview/getting started/user guide page, and I think
reorganizing these sections into separate pages is out of scope for that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20712 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSSRVRREPIZEMF6GJPDUOREVBANCNFSM5IXPPKRQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@ericl Getting groupby + aggregations into our existing overview docs page before the 1.9 release is a P0 verging on a release blocker, while reorganizing the overview docs page is a P1 in my mind and I don't want that to block this from getting merged. Do you disagree with that prioritization? I can break the overview docs page into multiple pages tomorrow in a follow-up PR, just don't want to block the release on that. |
ceba109
to
2f230e2
Compare
@clarkzinzow , I'm suggesting moving the new section to a new page, not re-organizing the docs. I think this will be a strict improvement over the current PR, which makes the overview too long / detailed. We can revisit a more holistic refactoring separately. |
@ericl Done, please review. |
51ae29c
to
654f202
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will 62FB be displayed to describe this comment to others. Learn more.
Looks good, but could we call it ML preprocessing instead of last mile? I think that will be the more common search term.
Sure, no concerns that removing last-mile will confuse our positioning? |
Merge it since the doc has been built. |
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
Checks
scripts/format.sh
to lint the changes in this PR.