8000 feat(sdk): Support MPI-based TrainJobs by andreyvelich · Pull Request #2545 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat(sdk): Support MPI-based TrainJobs #2545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 20, 2025

Conversation

andreyvelich
Copy link
Member
@andreyvelich andreyvelich commented Mar 19, 2025

I've made the required changes to support MPI-based TrainJob in Kubeflow SDK.
This is blocked by using the node as ReplicatedJob and container name for the trainer nodes.
For the launcher ReplicatedJob, we will still use node as container name since we are going to run launcher as node by default in MPI.

I also update the following:

  • Remove the phase from the runtimes
  • Rename components to steps
  • Updated the Runtime object structure.

TODO:

  • Support mpirun as entrypoint

cc @astefanutti @kubeflow/wg-training-leads @Electronic-Waste

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coveralls
Copy link
coveralls commented Mar 19, 2025

Coverage Status

coverage: 62.757%. remained the same
when pulling 2c3e4df on andreyvelich:sdk-ancestor-updates
into 5c89faa on kubeflow:master.

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Mar 19, 2025
@andreyvelich andreyvelich force-pushed the sdk-ancestor-updates branch from b6ad8fa to 796645a Compare March 19, 2025 12:10
# The dict where key is the container image and value its representation.
# Each Trainer representation defines trainer parameters (e.g. type, framework, entrypoint).
# TODO (andreyvelich): We should allow user to overrides the default image names.
ALL_TRAINERS: Dict[str, types.Trainer] = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti @kubeflow/wg-training-leads @saileshd1402 @Electronic-Waste Please let me know what do you think about this model to detect Runtime Trainers from the image name?

Comment on lines +77 to +81
@dataclass
class Runtime:
name: str
trainer: Trainer
pretrained_model: Optional[str] = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During today's call I was talking about this new structure of Runtime object in the SDK.
We need it due to various dependencies, for example entrypoint is different:

entrypoint=runtime.trainer.entrypoint,

@kubeflow/wg-training-leads @Electronic-Waste @astefanutti @shravan-achar @akshaychitneni @saileshd1402 @deepanker13 Does it look good to you ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there are no corresponding fields in TrainingRuntime for trainer, right? Why shall we separate it from runtime? For the mapping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main goal is to make it clear for ML Users that these parameters are related to the Trainer:

    trainer_type: TrainerType
    framework: Framework
    entrypoint: str
    accelerator: str = constants.UNKNOWN
    accelerator_count: Union[str, float, int] = constants.UNKNOWN

They don't need to live in the TrainingRuntime API, since this API is designed for Platform Engineers.

In the future, we can also separate pretrained_model to the initializer field in the Runtime class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Thanks for the explanation:)

@andreyvelich
Copy link
Member Author

The E2Es are working 🎉
Please take a look when you can.
/hold blocked by: #2548

@andreyvelich andreyvelich changed the title [WIP] feat(sdk): Support MPI-based TrainJobs feat(sdk): Support MPI-based TrainJobs Mar 19, 2025
raise Exception(f"Runtime doesn't have trainer container {replicated_jobs}")

# Extract image name from the container image to get appropriate Trainer.
image_name = trainer_container.image.split(":")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it fail for custom images? Should we get trainer object based on framework.type instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get trainer object based on framework.type instead?

We can't get the framework type from the TrainingRuntime. As we discussed in the Slack channel, we decided not to add labels to the TrainingRuntime.

However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type, entrypoint, framework.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it fail for custom images?

I guess, that might fail for images that have port name: https://kubernetes.io/docs/concepts/containers/images/#image-names

However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image.

I agree with you that that should be on the API level, but we might need to spend time to discuss what is the ideal API in the TrainingRuntime.
I don't want to block torchtune + MPI progress due to this.

@astefanutti @tenzen-y @Electronic-Waste @akshaychitneni @saileshd1402 Do we want to discuss it now or migrate in the future ?

Copy link
Member
@tenzen-y tenzen-y Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to migrate in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type, entrypoint, framework.

Note that API is not a toolbox. Ideally, we want to obtain information from existing fields as much as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can use https://github.com/docker/docker-py/. But not sure for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's talk about it later on how we should design Runtime APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to migrate in the future

I agree, since KubeCon is approaching.

Copy link
Member
@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this. I have a few questions for you:)

Comment on lines +77 to +81
@dataclass
class Runtime:
name: str
trainer: Trainer
pretrained_model: Optional[str] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there are no corresponding fields in TrainingRuntime for trainer, right? Why shall we separate it from runtime? For the mapping?

Comment on lines +84 to +92
# Representation for the TrainJob steps.
@dataclass
class Step:
name: str
status: Optional[str]
pod_name: str
device: str = constants.UNKNOWN
device_count: Union[str, int] = constants.UNKNOWN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for config override for pod with trainer.kubeflow.org/trainer-ancestor-step label?

Copy link
Member Author
@andreyvelich andreyvelich Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not always, for example for MPI use-case (e.g. Launcher + Node), Node ReplicatedJob doesn't have this label, but we still need to show users number of nodes on TrainJob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node ReplicatedJob doesn't have this label

May I ask why Node ReplicatedJob doesn't have this label? Doesn't it need config override?

we still add pull the data for Steps

What do you mean by "add pull the data for Steps"? Could you please elaborate a bit so that I may understand you better:)

Copy link
Member Author
@andreyvelich andreyvelich Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant that MPI job creates 2 ReplicatedJob: Launcher + Node.
However, when users run: get_job().steps API they want to see the following TrainJob steps (in case user sets num_nodes=3):

trainer-node-0
trainer-node-1
trainer-node-2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I understand correctly, Node ReplicatedJob does not need config mutation but we need to show it in the steps?

- name: node
template:
spec:
template:
spec:
containers:
- name: node
image: mpioperator/mpi-pi:openmpi
securityContext:
runAsUser: 1000
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
readinessProbe:
tcpSocket:
port: 2222
initialDelaySeconds: 5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, node does not require config mutation. All things are done only by launcher.

Copy link
Member
@tenzen-y tenzen-y Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this SDK interface tries to provide comprehensive view across all supported frameworks like Torch and Deepspeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Thanks for the info:)

class Trainer:
trainer_type: TrainerType
framework: Framework
entrypoint: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need entrypoint? Can't we just specify it in the TrainingRuntime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do this always:


As you can see that allows users to kick-off simple TrainJob to inspect what is "inside the runtime":

train(
  runtime=Runtime(name=torch-distributed)
)

It might be useful if users want to see the installed packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

raise Exception(f"Runtime doesn't have trainer container {replicated_jobs}")

# Extract image name from the container image to get appropriate Trainer.
image_name = trainer_container.image.split(":")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get trainer object based on framework.type instead?

We can't get the framework type from the TrainingRuntime. As we discussed in the Slack channel, we decided not to add labels to the TrainingRuntime.

However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type, entrypoint, framework.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich andreyvelich force-pushed the sdk-ancestor-updates branch from 22d8936 to 8e9a642 Compare March 20, 2025 14:04
@coveralls
Copy link
coveralls commented Mar 20, 2025

Pull Request Test Coverage Report for Build 13975854568

Details

  • 9 of 10 (90.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 64.313%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/jobset/builder.go 0 1 0.0%
Totals Coverage Status
Change from base Build 13971194955: 0.0%
Covered Lines: 1676
Relevant Lines: 2606

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

/hold cancel

@andreyvelich
Copy link
Member Author

@Electronic-Waste @tenzen-y @astefanutti @akshaychitneni If you are happy with the changes, we make merge it.
So we can address the followup changes in the next PRs.

Copy link
Member
@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich LGTM! Thanks for this. Let's move forward.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@@ -175,7 +175,7 @@ func (m *MPI) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob) er
WithMountPath(*info.RuntimePolicy.MLPolicySource.MPI.SSHAuthMountPath),
}...,
)
if ps.Name == constants.JobLauncher && container.Name == constants.ContainerLauncher {
if ps.Name == constants.JobLauncher && container.Name == constants.Node {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ps.Name == constants.JobLauncher && container.Name == constants.Node {
if ps.Name == constants.JobLauncher && (container.Name == constants.Node || container.Name == constants.ContainerLauncher){

This should be considered for runLauncherAsNode: false
Ideally, we want to consider whether or not runLauncherAsNode here, but for now, we can just consider Launcher container

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link
Member
@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you
Feel free to merge this one

/lgtm
/approve
/hold

F438
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Electronic-Waste, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member Author

Thanks everyone for the review!
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 4b0c294 into kubeflow:master Mar 20, 2025
16 checks passed
@andreyvelich andreyvelich deleted the sdk-ancestor-updates branch March 20, 2025 18:19
Electronic-Waste added a commit to Electronic-Waste/training-operator that referenced this pull request Mar 24, 2025
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Electronic-Waste added a commit to Electronic-Waste/training-operator that referenced this pull request Mar 24, 2025
Signed-off-by: Electronic-Waste <2690692950@qq.com>
google-oss-prow bot pushed a commit that referenced this pull request Mar 25, 2025
* feat(doc): add Runtime API design in KEP-2401.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix typo error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): update the implementation history.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): rename model to pretrained_model.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): update runtime class according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): update the runtimes design according to PR #2545

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): update train() API according to PR #2545

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update runtime_ref field.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
* feat(sdk): Support MPI-based TrainJobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor list_runtimes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add Runtime Trainer object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update for new Runtime object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Implement get_runtime API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Torch example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove un-unsed consts

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update func args

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update SDK constants

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Change to 16Gi

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix container name for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Keep launcher container for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
* feat(sdk): Support MPI-based TrainJobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor list_runtimes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add Runtime Trainer object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update for new Runtime object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Implement get_runtime API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Torch example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove un-unsed consts

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update func args

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update SDK constants

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Change to 16Gi

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix container name for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Keep launcher container for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
* feat(sdk): Support MPI-based TrainJobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor list_runtimes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add Runtime Trainer object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update for new Runtime object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Implement get_runtime API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Torch example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove un-unsed consts

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update func args

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update SDK constants

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Change to 16Gi

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix container name for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Keep launcher container for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 5, 2025
* feat(sdk): Support MPI-based TrainJobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor list_runtimes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add Runtime Trainer object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update for new Runtime object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Implement get_runtime API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Torch example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove un-unsed consts

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update func args

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update SDK constants

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Change to 16Gi

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix container name for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Keep launcher container for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0