feat(sdk): Support MPI-based TrainJobs #2545

andreyvelich · 2025-03-19T03:59:26Z

I've made the required changes to support MPI-based TrainJob in Kubeflow SDK.
This is blocked by using the node as ReplicatedJob and container name for the trainer nodes.
For the launcher ReplicatedJob, we will still use node as container name since we are going to run launcher as node by default in MPI.

I also update the following:

Remove the phase from the runtimes
Rename components to steps
Updated the Runtime object structure.

TODO:

Support mpirun as entrypoint

cc @astefanutti @kubeflow/wg-training-leads @Electronic-Waste

review-notebook-app · 2025-03-19T03:59:31Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2025-03-19T04:03:37Z

coverage: 62.757%. remained the same
when pulling 2c3e4df on andreyvelich:sdk-ancestor-updates
into 5c89faa on kubeflow:master.

andreyvelich · 2025-03-19T20:17:57Z

sdk/kubeflow/trainer/constants/constants.py

+# The dict where key is the container image and value its representation.
+# Each Trainer representation defines trainer parameters (e.g. type, framework, entrypoint).
+# TODO (andreyvelich): We should allow user to overrides the default image names.
+ALL_TRAINERS: Dict[str, types.Trainer] = {


@astefanutti @kubeflow/wg-training-leads @saileshd1402 @Electronic-Waste Please let me know what do you think about this model to detect Runtime Trainers from the image name?

andreyvelich · 2025-03-19T20:20:05Z

sdk/kubeflow/trainer/types/types.py

+@dataclass
+class Runtime:
+    name: str
+    trainer: Trainer
+    pretrained_model: Optional[str] = None


During today's call I was talking about this new structure of Runtime object in the SDK.
We need it due to various dependencies, for example entrypoint is different:

trainer/sdk/kubeflow/trainer/utils/utils.py

Line 298 in d1f8d49

entrypoint=runtime.trainer.entrypoint,

@kubeflow/wg-training-leads @Electronic-Waste @astefanutti @shravan-achar @akshaychitneni @saileshd1402 @deepanker13 Does it look good to you ?

AFAIK, there are no corresponding fields in TrainingRuntime for trainer, right? Why shall we separate it from runtime? For the mapping?

The main goal is to make it clear for ML Users that these parameters are related to the Trainer:

trainer_type: TrainerType framework: Framework entrypoint: str accelerator: str = constants.UNKNOWN accelerator_count: Union[str, float, int] = constants.UNKNOWN

They don't need to live in the TrainingRuntime API, since this API is designed for Platform Engineers.

In the future, we can also separate pretrained_model to the initializer field in the Runtime class.

SGTM. Thanks for the explanation:)

andreyvelich · 2025-03-19T22:30:40Z

The E2Es are working 🎉
Please take a look when you can.
/hold blocked by: #2548

akshaychitneni · 2025-03-20T02:20:43Z

sdk/kubeflow/trainer/utils/utils.py

+        raise Exception(f"Runtime doesn't have trainer container {replicated_jobs}")
+
+    # Extract image name from the container image to get appropriate Trainer.
+    image_name = trainer_container.image.split(":")[0]


Would it fail for custom images? Should we get trainer object based on framework.type instead?

Should we get trainer object based on framework.type instead?

We can't get the framework type from the TrainingRuntime. As we discussed in the Slack channel, we decided not to add labels to the TrainingRuntime.

However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type, entrypoint, framework.

Would it fail for custom images?

I guess, that might fail for images that have port name: https://kubernetes.io/docs/concepts/containers/images/#image-names

However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image.

I agree with you that that should be on the API level, but we might need to spend time to discuss what is the ideal API in the TrainingRuntime.
I don't want to block torchtune + MPI progress due to this.

@astefanutti @tenzen-y @Electronic-Waste @akshaychitneni @saileshd1402 Do we want to discuss it now or migrate in the future ?

I would like to migrate in the future

However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type, entrypoint, framework.

Note that API is not a toolbox. Ideally, we want to obtain information from existing fields as much as possible.

I guess we can use https://github.com/docker/docker-py/. But not sure for now.

Yes, let's talk about it later on how we should design Runtime APIs.

I would like to migrate in the future

I agree, since KubeCon is approaching.

Electronic-Waste

@andreyvelich Thanks for this. I have a few questions for you:)

Electronic-Waste · 2025-03-20T12:23:53Z

sdk/kubeflow/trainer/types/types.py

+@dataclass
+class Runtime:
+    name: str
+    trainer: Trainer
+    pretrained_model: Optional[str] = None


AFAIK, there are no corresponding fields in TrainingRuntime for trainer, right? Why shall we separate it from runtime? For the mapping?

Electronic-Waste · 2025-03-20T12:25:20Z

sdk/kubeflow/trainer/types/types.py

+# Representation for the TrainJob steps.
+@dataclass
+class Step:
+    name: str
+    status: Optional[str]
+    pod_name: str
+    device: str = constants.UNKNOWN
+    device_count: Union[str, int] = constants.UNKNOWN
+


Is this for config override for pod with trainer.kubeflow.org/trainer-ancestor-step label?

Not always, for example for MPI use-case (e.g. Launcher + Node), Node ReplicatedJob doesn't have this label, but we still need to show users number of nodes on TrainJob.

Node ReplicatedJob doesn't have this label

May I ask why Node ReplicatedJob doesn't have this label? Doesn't it need config override?

we still add pull the data for Steps

What do you mean by "add pull the data for Steps"? Could you please elaborate a bit so that I may understand you better:)

Sorry, I meant that MPI job creates 2 ReplicatedJob: Launcher + Node.
However, when users run: get_job().steps API they want to see the following TrainJob steps (in case user sets num_nodes=3):

trainer-node-0 trainer-node-1 trainer-node-2

So, if I understand correctly, Node ReplicatedJob does not need config mutation but we need to show it in the steps?

trainer/manifests/base/runtimes/mpi_distributed.yaml

Lines 40 to 59 in 4b0c294

- name: node

template:

spec:

template:

spec:

containers:

- name: node

image: mpioperator/mpi-pi:openmpi

securityContext:

runAsUser: 1000

command:

- /usr/sbin/sshd

args:

- -De

- -f

- /home/mpiuser/.sshd_config

readinessProbe:

tcpSocket:

port: 2222

initialDelaySeconds: 5

Yes, node does not require config mutation. All things are done only by launcher.

However, this SDK interface tries to provide comprehensive view across all supported frameworks like Torch and Deepspeed.

SGTM. Thanks for the info:)

Electronic-Waste · 2025-03-20T12:26:00Z

sdk/kubeflow/trainer/types/types.py

+class Trainer:
+    trainer_type: TrainerType
+    framework: Framework
+    entrypoint: str


Why do we need entrypoint? Can't we just specify it in the TrainingRuntime?

We don't do this always:

trainer/manifests/base/runtimes/torch_distributed.yaml

Line 21 in d1f8d49

command:

As you can see that allows users to kick-off simple TrainJob to inspect what is "inside the runtime":

train( runtime=Runtime(name=torch-distributed) )

It might be useful if users want to see the installed packages.

Electronic-Waste · 2025-03-20T12:37:49Z

sdk/kubeflow/trainer/utils/utils.py

+        raise Exception(f"Runtime doesn't have trainer container {replicated_jobs}")
+
+    # Extract image name from the container image to get appropriate Trainer.
+    image_name = trainer_container.image.split(":")[0]


Should we get trainer object based on framework.type instead?

We can't get the framework type from the TrainingRuntime. As we discussed in the Slack channel, we decided not to add labels to the TrainingRuntime.

However, adding a map (container image -> [trainer_type, entrypoint, framework]) here is probably not a best practice, since users may want to specify their custom image. We should consider changing the TrainingRuntime APIs and add fields providing infos like trainer_type, entrypoint, framework.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

coveralls · 2025-03-20T14:16:27Z

Pull Request Test Coverage Report for Build 13975854568

Details

9 of 10 (90.0%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 64.313%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/runtime/framework/plugins/jobset/builder.go	0	1	0.0%

Totals
Change from base Build 13971194955:	0.0%
Covered Lines:	1676
Relevant Lines:	2606

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-03-20T16:31:02Z

/hold cancel

andreyvelich · 2025-03-20T16:32:25Z

@Electronic-Waste @tenzen-y @astefanutti @akshaychitneni If you are happy with the changes, we make merge it.
So we can address the followup changes in the next PRs.

Electronic-Waste

@andreyvelich LGTM! Thanks for this. Let's move forward.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y · 2025-03-20T17:16:34Z

pkg/runtime/framework/plugins/mpi/mpi.go

@@ -175,7 +175,7 @@ func (m *MPI) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob) er
 						WithMountPath(*info.RuntimePolicy.MLPolicySource.MPI.SSHAuthMountPath),
 				}...,
 			)
-			if ps.Name == constants.JobLauncher && container.Name == constants.ContainerLauncher {
+			if ps.Name == constants.JobLauncher && container.Name == constants.Node {


Suggested change

if ps.Name == constants.JobLauncher && container.Name == constants.Node {

if ps.Name == constants.JobLauncher && (container.Name == constants.Node || container.Name == constants.ContainerLauncher){

This should be considered for runLauncherAsNode: false
Ideally, we want to consider whether or not runLauncherAsNode here, but for now, we can just consider Launcher container

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y

Thank you
Feel free to merge this one

/lgtm
/approve
/hold

F438

google-oss-prow · 2025-03-20T17:50:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Electronic-Waste, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2025-03-20T18:07:02Z

Thanks everyone for the review!
/hold cancel

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* feat(doc): add Runtime API design in KEP-2401. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix typo error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): update the implementation history. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): rename model to pretrained_model. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): update runtime class according to the review. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): update the runtimes design according to PR #2545 Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): update train() API according to PR #2545 Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update runtime_ref field. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>

* feat(sdk): Support MPI-based TrainJobs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Refactor list_runtimes Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add Runtime Trainer object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update for new Runtime object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Implement get_runtime API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Torch example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove un-unsed consts Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update func args Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update SDK constants Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Change to 16Gi Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix container name for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Keep launcher container for MPI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added the do-not-merge/work-in-progress label Mar 19, 2025

google 8000 -oss-prow bot requested review from jinchihe and kuizhiqing March 19, 2025 03:59

google-oss-prow bot added the size/L label Mar 19, 2025

google-oss-prow bot added size/XL and removed size/L labels Mar 19, 2025

andreyvelich force-pushed the sdk-ancestor-updates branch from b6ad8fa to 796645a Compare March 19, 2025 12:10

andreyvelich commented Mar 19, 2025

View reviewed changes

google-oss-prow bot added size/XXL and removed size/XL labels Mar 19, 2025

google-oss-prow bot added the do-not-merge/hold label Mar 19, 2025

andreyvelich changed the title ~~[WIP] feat(sdk): Support MPI-based TrainJobs~~ feat(sdk): Support MPI-based TrainJobs Mar 19, 2025

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 19, 2025

akshaychitneni reviewed Mar 20, 2025

View reviewed changes

Electronic-Waste reviewed Mar 20, 2025

View reviewed changes

andreyvelich added 9 commits March 20, 2025 14:03

feat(sdk): Support MPI-based TrainJobs

151a1f5

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Refactor list_runtimes

292ce8f

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Fix example

85791e8

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add Runtime Trainer object

6ab6f27

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Update for new Runtime object

6d715b8

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Implement get_runtime API

2551383

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Fix Torch example

9fb61b2

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Remove un-unsed consts

00e3239

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Update func args

8e9a642

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich force-pushed the sdk-ancestor-updates branch from 22d8936 to 8e9a642 Compare March 20, 2025 14:04

Update SDK constants

914a412

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Change to 16Gi

99a7b81

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot removed the do-not-merge/hold label Mar 20, 2025

Electronic-Waste approved these changes Mar 20, 2025

View reviewed changes

Fix container name for MPI

6fe525f

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y reviewed Mar 20, 2025

View reviewed changes

Keep launcher container for MPI

0ab7798

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y reviewed Mar 20, 2025

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Mar 20, 2025

google-oss-prow bot assigned tenzen-y Mar 20, 2025

google-oss-prow bot added the lgtm label Mar 20, 2025

google-oss-prow bot added the approved label Mar 20, 2025

google-oss-prow bot removed the do-not-merge/hold label Mar 20, 2025

google-oss-prow bot merged commit 4b0c294 into kubeflow:master Mar 20, 2025
16 checks passed

andreyvelich deleted the sdk-ancestor-updates branch March 20, 2025 18:19

Electronic-Waste added a commit to Electronic-Waste/training-operator that referenced this pull request Mar 24, 2025

chore(doc): update the runtimes design according to PR kubeflow#2545

0f9d1be

Signed-off-by: Electronic-Waste <2690692950@qq.com>

Electronic-Waste added a commit to Electronic-Waste/training-operator that referenced this pull request Mar 24, 2025

chore(doc): update train() API according to PR kubeflow#2545

095c426

Signed-off-by: Electronic-Waste <2690692950@qq.com>

Electronic-Waste mentioned this pull request Mar 24, 2025

feat(doc): add Runtime API design in KEP-2401. #2501

Merged

1 task

	- name: node
	template:
	spec:
	template:
	spec:
	containers:
	- name: node
	image: mpioperator/mpi-pi:openmpi
	securityContext:
	runAsUser: 1000
	command:
	- /usr/sbin/sshd
	args:
	- -De
	- -f
	- /home/mpiuser/.sshd_config
	readinessProbe:
	tcpSocket:
	port: 2222
	initialDelaySeconds: 5

	if ps.Name == constants.JobLauncher && container.Name == constants.Node {
	if ps.Name == constants.JobLauncher && (container.Name == constants.Node \|\| container.Name == constants.ContainerLauncher){

feat(sdk): Support MPI-based TrainJobs #2545

feat(sdk): Support MPI-based TrainJobs #2545

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pull Request Test Coverage Report for Build 13975854568

Details

💛 - Coveralls

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment