-
Notifications
You must be signed in to change notification settings - Fork 786
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files #2669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files #2669
Conversation
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 15682869133Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign @kubeflow/wg-training-leads @astefanutti
@@ -72,12 +72,14 @@ spec: | |||
command: | |||
- tune | |||
- run | |||
- --rdzv_endpoint=localhost:29500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change because distributed training needs to use distributed mode, as we discussed in #2587 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we automatically inject rdzv endpoint in torch plugin ?
trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we automatically inject rdzv endpoint in torch plugin ?
Currently, I made mutation only enabled for applying TrainJob with SDK.
If users want to use TrainJob yaml file, they tend to expect that they have fully control of the overriding items.
Do we also want to enable it for applying with yaml file? @andreyvelich
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Electronic-Waste!
/lgtm
/assign @tenzen-y @astefanutti
@@ -72,21 +72,21 @@ spec: | |||
command: | |||
- tune | |||
- run | |||
- --rdzv_endpoint=localhost:29500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the logic in the torch plugin that always adds the --rdzv_endpoint
for torchtune be amended to only add it if it's not present in the runtime nor in the overrides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my current implementation, there are two ways to apply TrainJob:
- Using yaml file: Do not do the command/arg mutation. Users specify all commands and args (but use the cmd/args set in Runtime if users do not specify any cmd or args).
- Using SDK: We do the command/arg mutation for users.
I think, for applying with SDK, we need to always mutate --rdzv_endpoint
since users might enable multi-node training. In this case, we'll use the headless service of which url is generated with TrainJob name. And we can't know about the TrainJob name in the CTR before we creating a TrainJob instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think we should unify these two applying ways in the future, because:
- Robustness: If users forget to add args like
--rdzv_endpoint
when using TrainJob yaml file to override some of the args, we can generate a proper arg for them to provide fault tolerance in some degree. - Modularized Implementation: It will make the code easy to understand and maintain.
@@ -25,7 +25,7 @@ spec: | |||
- name: STORAGE_URI | |||
value: hf://tatsu-lab/alpaca | |||
volumeMounts: | |||
- mountPath: /workspace/dataset | |||
- mountPath: /workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if we change the mount path to /workspace
, it will conflict with the initializer:
trainer/cmd/initializers/dataset/Dockerfile
Lines 1 to 12 in b71a690
FROM python:3.11-alpine | |
WORKDIR /workspace | |
# Copy the required Python modules. | |
COPY cmd/initializers/dataset/requirements.txt . | |
COPY pkg/initializers pkg/initializers | |
# Install the needed packages. | |
RUN pip install -r requirements.txt | |
ENTRYPOINT ["python", "-m", "pkg.initializers.dataset"] |
which will remove the /workspace/pkg
dir and report error:
/usr/local/bin/python: Error while finding module specification for 'pkg.initializers.dataset' (ModuleNotFoundError: No module named 'pkg')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of changing the workdir to /
and mount the PVC to /workspace
to avoid the conflict.
WDYT? @andreyvelich @astefanutti
What this PR does / why we need it:
This PR 8000 fixed some errors occurred in the LLM fine-tuning workflow, which is conducted by applying TrainJob yaml files:
Results
/cc @kubeflow/wg-training-leads @astefanutti
/milestone v2.0
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: