8000 Update sft_datasets.md by penfever · Pull Request #1349 · oumi-ai/oumi · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Update sft_datasets.md #1349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 4, 2025
Merged

Conversation

penfever
Copy link
Collaborator
@penfever penfever commented Feb 3, 2025

Description

Updated the documentation for sft_datasets with more precise instructions for how to add a new dataset and how to override dataset_name to utilize existing datasets.

Related issues

Fixes # (issue)

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

Updated the documentation for sft_datasets with more precise instructions for how to add a new dataset and how to override dataset_name to utilize existing datasets.
@taenin taenin self-requested a review February 3, 2025 18:06
@@ -72,6 +72,7 @@ To add a new SFT dataset:

1. Subclass {py:class}`~oumi.core.datasets.BaseSftDataset`
2. Implement the {py:meth}`~oumi.core.datasets.BaseSftDataset.transform_conversation` method to define the dataset-specific transformation logic.
3. Register your new dataset to the dataset class by adding it to {py:class}`~oumi.core.datasets.__init__.py` and {py:class}`~oumi.datasets.sft.__init__.py`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does {py:class} work properly when referencing an __init__.py file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, good catch. Refactored to {py:mod}

@@ -110,6 +111,18 @@ class CustomSftDataset(BaseSftDataset):
For more advanced SFT dataset implementations, explore the `oumi.datasets` module, which contains implementations of several [open source datasets](https://github.com/oumi-ai/oumi/tree/main/src/oumi/datasets).
```

### Using an Unregistered Dataset Whose Format is Identical to a Registered Dataset

Many datasets on HuggingFace share the same format as Oumi registered datasets. It is not necessary to register each dataset explicitly to use it. Instead, you can override the `dataset_name` parameter.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we say ... dataset_name_override: kwargs parameter or somesuch for extra clarity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xrdaukar, I added a bit more text here to make it clear that users should refer to the example for further clarification.

Corrected Sphinx roles
Clarified description
@oelachqar oelachqar merged commit b121efd into oumi-ai:main Feb 4, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0