feat(runtimes): Support DeepSpeed Runtime with OpenMPI #2559

andreyvelich · 2025-03-22T05:01:04Z

This is implementation of DeepSpeed runtime in Kubeflow Trainer V2.

I added the T5 Fine-Tuning example with DeepSpeed and OpenMPI. The Notebook uses 8 x V100 GPUs across 2 MPI Nodes.
I hope we can run this Notebook on OKE, when we have resources as part of: #2432

/hold for review

cc @kubeflow/wg-training-leads @Electronic-Waste @astefanutti @kuizhiqing @seanlaii @saileshd1402 @deepanker13 @shravan-achar @akshaychitneni @Syulin7 @franciscojavierarceo @kannon92 @chasecadet @StefanoFioravanzo @jaiakash @ahg-g @vsoch

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

review-notebook-app · 2025-03-22T05:01:09Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2025-03-22T05:05:21Z

Pull Request Test Coverage Report for Build 14041190783

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
17 unchanged lines in 4 files lost coverage.
Overall coverage increased (+0.6%) to 64.979%

Files with Coverage Reduction	New Missed Lines	%
pkg/runtime/framework/plugins/jobset/jobset.go	1	24.03%
pkg/util/testing/wrapper.go	2	99.04%
pkg/runtime/core/clustertrainingruntime.go	3	37.93%
pkg/runtime/framework/plugins/jobset/builder.go	11	0.0%

Totals
Change from base Build 13991199064:	0.6%
Covered Lines:	1694
Relevant Lines:	2607

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-03-22T19:32:14Z

cmd/runtimes/deepspeed/Dockerfile

+    cmake g++ gcc \
+    wget vim \
+    openssh-client openssh-server libcap2-bin \
+    libopenmpi-dev openmpi-bin


I downgraded OpenMPI version since 5.0 is not required for DeepSpeed to work.
That should speedup the image building time.

Electronic-Waste

@andreyvelich Thanks for this. I left my initial comments for you.

Electronic-Waste · 2025-03-24T07:35:32Z

cmd/runtimes/deepspeed/Dockerfile

Since DeepSeed belongs to CustomTrainer in SDK. Shall we rename this dir cmd/trainers/deepseed?

I think, this is not actually trainer, but the runtime, since the image doesn't implement the training code.
I can imagine that someone might want to create actual training script with DeepSpeed, and give users configuration they want to adjust.
In that case, we can put it under trainers.

Electronic-Waste · 2025-03-24T07:38:55Z

manifests/base/runtimes/deepspeed_distributed.yaml

+metadata:
+  name: deepspeed-distributed
+  labels:
+    trainer.kubeflow.org/accelerator: gpu-tesla-v100-16gb


What is this label for? Have we agreed on adding a label like this?

If I recall correctly, we discussed in slack and come to a conclusion that we should avoid label like this.

Let me remove it for now, so we can talk about this label after.

Electronic-Waste · 2025-03-24T07:40:20Z

manifests/base/runtimes/deepspeed_distributed.yaml

+                spec:
+                  containers:
+                    - name: node
+                      image: docker.io/andreyvelichkevich/deepspeed-runtime


Suggested change

image: docker.io/andreyvelichkevich/deepspeed-runtime

image: ghcr.io/kubeflow/trainer/deepspeed-runtime

Shall we change this image to github-hosted one?

Electronic-Waste · 2025-03-24T07:40:35Z

manifests/base/runtimes/deepspeed_distributed.yaml

+                spec:
+                  containers:
+                    - name: node
+                      image: docker.io/andreyvelichkevich/deepspeed-runtime


Suggested change

image: docker.io/andreyvelichkevich/deepspeed-runtime

image: ghcr.io/kubeflow/trainer/deepspeed-runtime

Electronic-Waste · 2025-03-24T07:42:11Z

manifests/base/runtimes/kustomization.yaml

  - torch_distributed.yaml
  - mpi_distributed.yaml
+  - deepspeed_distributed.yaml


Suggested change

- torch_distributed.yaml

- mpi_distributed.yaml

- deepspeed_distributed.yaml

- deepspeed_distributed.yaml

- mpi_distributed.yaml

- torch_distributed.yaml

If there are no strict restrictions on installation order, we'd better list them alphabetically:)

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-03-24T15:54:22Z

/hold cancel

andreyvelich · 2025-03-24T15:55:45Z

.github/workflows/build-and-push-images.yaml

@@ -23,10 +23,12 @@ jobs:
          - component-name: dataset-initializer
            dockerfile: cmd/initializers/dataset/Dockerfile
            platforms: linux/amd64,linux/arm64
+          - component-name: deepspeed-runtime


The image build takes ~ 40 minutes due to image size.
Is there any way for us to speed it up ?

tenzen-y · 2025-03-24T16:01:11Z

cmd/runtimes/deepspeed/Dockerfile

@@ -0,0 +1,42 @@
+FROM nvidia/cuda:12.4.1-devel-ubuntu22.04


Suggested change

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

FROM mpioperator/base:v0.6.0 AS mpi

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

I want to avoid managing sshd config file in this repository.

Good suggestion!

tenzen-y · 2025-03-24T16:02:24Z

cmd/runtimes/deepspeed/Dockerfile

+# Disable StrictHostKeyChecking to Allow OpenSSH to talk to containers without asking for it.
+# TrainJob controller mounts the .ssh folder from a Secret.
+# Disable UserKnownHostsFile to avoid write permissions on .ssh folder.
+# Disabling StrictModes avoids directory and files read permission checks.
+RUN sed -i "s/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g" /etc/ssh/ssh_config
+RUN echo "    UserKnownHostsFile /dev/null" >>/etc/ssh/ssh_config
+RUN sed -i "s/[ #]\(.*Port \).*/ \1${PORT}/g" /etc/ssh/ssh_config
+RUN sed -i "s/#\(StrictModes \).*/\1no/g" /etc/ssh/sshd_config
+RUN sed -i "s/#\(Port \).*/\1${PORT}/g" /etc/ssh/sshd_config


Suggested change

# Disable StrictHostKeyChecking to Allow OpenSSH to talk to containers without asking for it.

# TrainJob controller mounts the .ssh folder from a Secret.

# Disable UserKnownHostsFile to avoid write permissions on .ssh folder.

# Disabling StrictModes avoids directory and files read permission checks.

RUN sed -i "s/[ #]$.*StrictHostKeyChecking $.*/ \1no/g" /etc/ssh/ssh_config

RUN echo " UserKnownHostsFile /dev/null" >>/etc/ssh/ssh_config

RUN sed -i "s/[ #]$.*Port $.*/ \1${PORT}/g" /etc/ssh/ssh_config

RUN sed -i "s/#$StrictModes $.*/\1no/g" /etc/ssh/sshd_config

RUN sed -i "s/#$Port $.*/\1${PORT}/g" /etc/ssh/sshd_config

tenzen-y · 2025-03-24T16:02:42Z

cmd/runtimes/deepspeed/Dockerfile

+RUN useradd -m mpiuser
+WORKDIR /home/mpiuser
+
+# Configurations for running sshd as non-root.


Suggested change

# Configurations for running sshd as non-root.

# Configurations for running sshd as non-root.

COPY --from=mpi /home/mpiuser/.sshd_config /home/mpiuser/.sshd_config

tenzen-y · 2025-03-24T16:03:09Z

cmd/runtimes/deepspeed/sshd_config

Removing this file to avoid managing sshd config file in this repository

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y

Thank you
/lgtm
/approve

/hold

google-oss-prow · 2025-03-24T17:29:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2025-03-24T19:33:00Z

Thanks everyone for the review 🚀
/hold cancel

…ner#2559) * feat(runtimes): Support DeepSpeed Runtime Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Downgrade OpenMPI to 4.0 version Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix the runtime spec Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Reuse sshd config from MPI operator Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

feat(runtimes): Support DeepSpeed Runtime

6da5cbc

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added the do-not-merge/hold label Mar 22, 2025

google-oss-prow bot requested review from Electronic-Waste and jinchihe March 22, 2025 05:01

google-oss-prow bot added the size/XXL label Mar 22, 2025

Downgrade OpenMPI to 4.0 version

14cb992

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich commented Mar 22, 2025

View reviewed changes

andreyvelich mentioned this pull request Mar 24, 2025

feat(runtimes): Support MLX Distributed Runtime with OpenMPI #2565

Merged

Electronic-Waste reviewed Mar 24, 2025

View reviewed changes

Fix the runtime spec

3fd84c1

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot removed the do-not-merge/hold label Mar 24, 2025

andreyvelich commented Mar 24, 2025

View reviewed changes

tenzen-y reviewed Mar 24, 2025

View reviewed changes

Reuse sshd config from MPI operator

8eddfe2

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y reviewed Mar 24, 2025

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Mar 24, 2025

google-oss-prow bot assigned tenzen-y Mar 24, 2025

google-oss-prow bot added the lgtm label Mar 24, 2025

google-oss-prow bot added the approved label Mar 24, 2025

google-oss-prow bot removed the do-not-merge/hold label Mar 24, 2025

google-oss-prow bot merged commit aa0e289 into kubeflow:master Mar 24, 2025
17 checks passed

andreyvelich mentioned this pull request Mar 24, 2025

Create DeepSpeed Runtime with Kubeflow Trainer #2517

Closed

andreyvelich deleted the deepspeed-runtime branch March 24, 2025 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(runtimes): Support DeepSpeed Runtime with OpenMPI #2559

feat(runtimes): Support DeepSpeed Runtime with OpenMPI #2559

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	image: docker.io/andreyvelichkevich/deepspeed-runtime
	image: ghcr.io/kubeflow/trainer/deepspeed-runtime

	FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
	FROM mpioperator/base:v0.6.0 AS mpi
	FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

	# Configurations for running sshd as non-root.
	# Configurations for running sshd as non-root.
	COPY --from=mpi /home/mpiuser/.sshd_config /home/mpiuser/.sshd_config

feat(runtimes): Support DeepSpeed Runtime with OpenMPI #2559

feat(runtimes): Support DeepSpeed Runtime with OpenMPI #2559

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pull Request Test Coverage Report for Build 14041190783

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!