-
Notifications
You must be signed in to change notification settings - Fork 786
feat(sdk): Support MPI-based TrainJobs #2545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sdk): Support MPI-based TrainJobs #2545
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
b6ad8fa
to
796645a
Compare
# The dict where key is the container image and value its representation. | ||
# Each Trainer representation defines trainer parameters (e.g. type, framework, entrypoint). | ||
# TODO (andreyvelich): We should allow user to overrides the default image names. | ||
ALL_TRAINERS: Dict[str, types.Trainer] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti @kubeflow/wg-training-leads @saileshd1402 @Electronic-Waste Please let me know what do you think about this model to detect Runtime Trainers from the image name?
@dataclass | ||
class Runtime: | ||
name: str | ||
trainer: Trainer | ||
pretrained_model: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During today's call I was talking about this new structure of Runtime object in the SDK.
We need it due to various dependencies, for example entrypoint is different:
trainer/sdk/kubeflow/trainer/utils/utils.py
Line 298 in d1f8d49
entrypoint=runtime.trainer.entrypoint, |
@kubeflow/wg-training-leads @Electronic-Waste @astefanutti @shravan-achar @akshaychitneni @saileshd1402 @deepanker13 Does it look good to you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, there are no corresponding fields in TrainingRuntime
for trainer
, right? Why shall we separate it from runtime
? For the mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main goal is to make it clear for ML Users that these parameters are related to the Trainer:
trainer_type: TrainerType
framework: Framework
entrypoint: str
accelerator: str = constants.UNKNOWN
accelerator_count: Union[str, float, int] = constants.UNKNOWN
They don't need to live in the TrainingRuntime
API, since this API is designed for Platform Engineers.
In the future, we can also separate pretrained_model to the initializer
field in the Runtime
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Thanks for the explanation:)
The E2Es are working 🎉 |
raise Exception(f"Runtime doesn't have trainer container {replicated_jobs}") | ||
|
||
# Extract image name from the container image to get appropriate Trainer. | ||
image_name = trainer_container.image.split(":")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it fail for custom images? Should we get trainer object based on framework.type instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get trainer object based on framework.type instead?
We can't get the framework type from the TrainingRuntime. As we discussed in the Slack channel, we decided not to add labels to the TrainingRuntime.
However, adding a map (container image -> [trainer_type, entrypoint, framework]
) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type
, entrypoint
, framework
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it fail for custom images?
I guess, that might fail for images that have port name: https://kubernetes.io/docs/concepts/containers/images/#image-names
However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image.
I agree with you that that should be on the API level, but we might need to spend time to discuss what is the ideal API in the TrainingRuntime.
I don't want to block torchtune + MPI progress due to this.
@astefanutti @tenzen-y @Electronic-Waste @akshaychitneni @saileshd1402 Do we want to discuss it now or migrate in the future ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to migrate in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type, entrypoint, framework.
Note that API is not a toolbox. Ideally, we want to obtain information from existing fields as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can use https://github.com/docker/docker-py/. But not sure for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's talk about it later on how we should design Runtime APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to migrate in the future
I agree, since KubeCon is approaching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks for this. I have a few questions for you:)
@dataclass | ||
class Runtime: | ||
name: str | ||
trainer: Trainer | ||
pretrained_model: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, there are no corresponding fields in TrainingRuntime
for trainer
, right? Why shall we separate it from runtime
? For the mapping?
# Representation for the TrainJob steps. | ||
@dataclass | ||
class Step: | ||
name: str | ||
status: Optional[str] | ||
pod_name: str | ||
device: str = constants.UNKNOWN | ||
device_count: Union[str, int] = constants.UNKNOWN | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for config override for pod with trainer.kubeflow.org/trainer-ancestor-step
label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not always, for example for MPI use-case (e.g. Launcher + Node), Node ReplicatedJob doesn't have this label, but we still need to show users number of nodes on TrainJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node ReplicatedJob doesn't have this label
May I ask why Node ReplicatedJob doesn't have this label? Doesn't it need config override?
we still add pull the data for Steps
What do you mean by "add pull the data for Steps"? Could you please elaborate a bit so that I may understand you better:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant that MPI job creates 2 ReplicatedJob: Launcher + Node.
However, when users run: get_job().steps
API they want to see the following TrainJob steps (in case user sets num_nodes=3):
trainer-node-0
trainer-node-1
trainer-node-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I understand correctly, Node
ReplicatedJob does not need config mutation but we need to show it in the steps
?
trainer/manifests/base/runtimes/mpi_distributed.yaml
Lines 40 to 59 in 4b0c294
- name: node | |
template: | |
spec: | |
template: | |
spec: | |
containers: | |
- name: node | |
image: mpioperator/mpi-pi:openmpi | |
securityContext: | |
runAsUser: 1000 | |
command: | |
- /usr/sbin/sshd | |
args: | |
- -De | |
- -f | |
- /home/mpiuser/.sshd_config | |
readinessProbe: | |
tcpSocket: | |
port: 2222 | |
initialDelaySeconds: 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, node does not require config mutation. All things are done only by launcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this SDK interface tries to provide comprehensive view across all supported frameworks like Torch and Deepspeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Thanks for the info:)
class Trainer: | ||
trainer_type: TrainerType | ||
framework: Framework | ||
entrypoint: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need entrypoint
? Can't we just specify it in the TrainingRuntime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't do this always:
command: |
As you can see that allows users to kick-off simple TrainJob to inspect what is "inside the runtime":
train(
runtime=Runtime(name=torch-distributed)
)
It might be useful if users want to see the installed packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
raise Exception(f"Runtime doesn't have trainer container {replicated_jobs}") | ||
|
||
# Extract image name from the container image to get appropriate Trainer. | ||
image_name = trainer_container.image.split(":")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get trainer object based on framework.type instead?
We can't get the framework type from the TrainingRuntime. As we discussed in the Slack channel, we decided not to add labels to the TrainingRuntime.
However, adding a map (container image -> [trainer_type, entrypoint, framework]
) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type
, entrypoint
, framework
.
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
22d8936
to
8e9a642
Compare
Pull Request Test Coverage Report for Build 13975854568Details
💛 - Coveralls |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
/hold cancel |
@Electronic-Waste @tenzen-y @astefanutti @akshaychitneni If you are happy with the changes, we make merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich LGTM! Thanks for this. Let's move forward.
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@@ -175,7 +175,7 @@ func (m *MPI) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob) er | |||
WithMountPath(*info.RuntimePolicy.MLPolicySource.MPI.SSHAuthMountPath), | |||
}..., | |||
) | |||
if ps.Name == constants.JobLauncher && container.Name == constants.ContainerLauncher { | |||
if ps.Name == constants.JobLauncher && container.Name == constants.Node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ps.Name == constants.JobLauncher && container.Name == constants.Node { | |
if ps.Name == constants.JobLauncher && (container.Name == constants.Node || container.Name == constants.ContainerLauncher){ |
This should be considered for runLauncherAsNode: false
Ideally, we want to consider whether or not runLauncherAsNode here, but for now, we can just consider Launcher container
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
Feel free to merge this one
/lgtm
/approve
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Electronic-Waste, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks everyone for the review! |
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* feat(doc): add Runtime API design in KEP-2401. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix typo error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): update the implementation history. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): rename model to pretrained_model. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): update runtime class according to the review. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): update the runtimes design according to PR #2545 Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): update train() API according to PR #2545 Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update runtime_ref field. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
* feat(sdk): Support MPI-based TrainJobs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Refactor list_runtimes Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add Runtime Trainer object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update for new Runtime object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Implement get_runtime API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Torch example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove un-unsed consts Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update func args Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update SDK constants Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Change to 16Gi Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix container name for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Keep launcher container for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* feat(sdk): Support MPI-based TrainJobs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Refactor list_runtimes Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add Runtime Trainer object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update for new Runtime object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Implement get_runtime API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Torch example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove un-unsed consts Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update func args Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update SDK constants Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Change to 16Gi Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix container name for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Keep launcher container for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* feat(sdk): Support MPI-based TrainJobs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Refactor list_runtimes Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add Runtime Trainer object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update for new Runtime object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Implement get_runtime API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Torch example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove un-unsed consts Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update func args Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update SDK constants Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Change to 16Gi Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix container name for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Keep launcher container for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* feat(sdk): Support MPI-based TrainJobs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Refactor list_runtimes Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add Runtime Trainer object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update for new Runtime object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Implement get_runtime API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Torch example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove un-unsed consts Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update func args Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update SDK constants Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Change to 16Gi Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix container name for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Keep launcher container for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
I've made the required changes to support MPI-based TrainJob in Kubeflow SDK.
This is blocked by using the
node
as ReplicatedJob and container name for the trainer nodes.For the
launcher
ReplicatedJob, we will still usenode
as container name since we are going to run launcher as node by default in MPI.I also update the following:
TODO:
mpirun
as entrypointcc @astefanutti @kubeflow/wg-training-leads @Electronic-Waste