-
Notifications
You must be signed in to change notification settings - Fork 589
Update sft_datasets.md #1349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sft_datasets.md #1349
Conversation
Updated the documentation for sft_datasets with more precise instructions for how to add a new dataset and how to override dataset_name to utilize existing datasets.
@@ -72,6 +72,7 @@ To add a new SFT dataset: | |||
|
|||
1. Subclass {py:class}`~oumi.core.datasets.BaseSftDataset` | |||
2. Implement the {py:meth}`~oumi.core.datasets.BaseSftDataset.transform_conversation` method to define the dataset-specific transformation logic. | |||
3. Register your new dataset to the dataset class by adding it to {py:class}`~oumi.core.datasets.__init__.py` and {py:class}`~oumi.datasets.sft.__init__.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does {py:class}
work properly when referencing an __init__.py
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, good catch. Refactored to {py:mod}
@@ -110,6 +111,18 @@ class CustomSftDataset(BaseSftDataset): | |||
For more advanced SFT dataset implementations, explore the `oumi.datasets` module, which contains implementations of several [open source datasets](https://github.com/oumi-ai/oumi/tree/main/src/oumi/datasets). | |||
``` | |||
|
|||
### Using an Unregistered Dataset Whose Format is Identical to a Registered Dataset | |||
|
|||
Many datasets on HuggingFace share the same format as Oumi registered datasets. It is not necessary to register each dataset explicitly to use it. Instead, you can override the `dataset_name` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we say ... dataset_name_override:
kwargs parameter or somesuch for extra clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xrdaukar, I added a bit more text here to make it clear that users should refer to the example for further clarification.
Corrected Sphinx roles
Clarified description
Description
Updated the documentation for sft_datasets with more precise instructions for how to add a new dataset and how to override dataset_name to utilize existing datasets.
Related issues
Fixes # (issue)
Before submitting
Reviewers
At least one review from a member of
oumi-ai/oumi-staff
is required.