8000 feat(runtimes): Support DeepSpeed Runtime with OpenMPI by andreyvelich · Pull Request #2559 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat(runtimes): Support DeepSpeed Runtime with OpenMPI #2559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 24, 2025

Conversation

andreyvelich
Copy link
Member
@andreyvelich andreyvelich commented Mar 22, 2025

Fixes: #2517

This is implementation of DeepSpeed runtime in Kubeflow Trainer V2.

I added the T5 Fine-Tuning example with DeepSpeed and OpenMPI. The Notebook uses 8 x V100 GPUs across 2 MPI Nodes.
I hope we can run this Notebook on OKE, when we have resources as part of: #2432

/hold for review

cc @kubeflow/wg-training-leads @Electronic-Waste @astefanutti @kuizhiqing @seanlaii @saileshd1402 @deepanker13 @shravan-achar @akshaychitneni @Syulin7 @franciscojavierarceo @kannon92 @chasecadet @StefanoFioravanzo @jaiakash @ahg-g @vsoch

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coveralls
Copy link
coveralls commented Mar 22, 2025

Pull Request Test Coverage Report for Build 14041190783

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 17 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.6%) to 64.979%

Files with Coverage Reduction New Missed Lines %
pkg/runtime/framework/plugins/jobset/jobset.go 1 24.03%
pkg/util/testing/wrapper.go 2 99.04%
pkg/runtime/core/clustertrainingruntime.go 3 37.93%
pkg/runtime/framework/plugins/jobset/builder.go 11 0.0%
Totals Coverage Status
Change from base Build 13991199064: 0.6%
Covered Lines: 1694
Relevant Lines: 2607

💛 - Coveralls

8000
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
cmake g++ gcc \
wget vim \
openssh-client openssh-server libcap2-bin \
libopenmpi-dev openmpi-bin
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I downgraded OpenMPI version since 5.0 is not required for DeepSpeed to work.
That should speedup the image building time.

Copy link
Member
@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this. I left my initial comments for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since DeepSeed belongs to CustomTrainer in SDK. Shall we rename this dir cmd/trainers/deepseed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, this is not actually trainer, but the runtime, since the image doesn't implement the training code.
I can imagine that someone might want to create actual training script with DeepSpeed, and give users configuration they want to adjust.
In that case, we can put it under trainers.

metadata:
name: deepspeed-distributed
labels:
trainer.kubeflow.org/accelerator: gpu-tesla-v100-16gb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this label for? Have we agreed on adding a label like this?

If I recall correctly, we discussed in slack and come to a conclusion that we should avoid label like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remove it for now, so we can talk about this label after.

spec:
containers:
- name: node
image: docker.io/andreyvelichkevich/deepspeed-runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image: docker.io/andreyvelichkevich/deepspeed-runtime
image: ghcr.io/kubeflow/trainer/deepspeed-runtime

Shall we change this image to github-hosted one?

spec:
containers:
- name: node
image: docker.io/andreyvelichkevich/deepspeed-runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image: docker.io/andreyvelichkevich/deepspeed-runtime
image: ghcr.io/kubeflow/trainer/deepspeed-runtime

Comment on lines 4 to 6
- torch_distributed.yaml
- mpi_distributed.yaml
- deepspeed_distributed.yaml
Copy link
Member
@Electronic-Waste Electronic-Waste Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- torch_distributed.yaml
- mpi_distributed.yaml
- deepspeed_distributed.yaml
- deepspeed_distributed.yaml
- mpi_distributed.yaml
- torch_distributed.yaml

If there are no strict restrictions on installation order, we'd better list them alphabetically:)

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

/hold cancel

@@ -23,10 +23,12 @@ jobs:
- component-name: dataset-initializer
dockerfile: cmd/initializers/dataset/Dockerfile
platforms: linux/amd64,linux/arm64
- component-name: deepspeed-runtime
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image build takes ~ 40 minutes due to image size.
Is there any way for us to speed it up ?

@@ -0,0 +1,42 @@
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
FROM mpioperator/base:v0.6.0 AS mpi
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

I want to avoid managing sshd config file in this repository.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion!

Comment on lines 19 to 27
# Disable StrictHostKeyChecking to Allow OpenSSH to talk to containers without asking for it.
# TrainJob controller mounts the .ssh folder from a Secret.
# Disable UserKnownHostsFile to avoid write permissions on .ssh folder.
# Disabling StrictModes avoids directory and files read permission checks.
RUN sed -i "s/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g" /etc/ssh/ssh_config
RUN echo " UserKnownHostsFile /dev/null" >>/etc/ssh/ssh_config
RUN sed -i "s/[ #]\(.*Port \).*/ \1${PORT}/g" /etc/ssh/ssh_config
RUN sed -i "s/#\(StrictModes \).*/\1no/g" /etc/ssh/sshd_config
RUN sed -i "s/#\(Port \).*/\1${PORT}/g" /etc/ssh/sshd_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Disable StrictHostKeyChecking to Allow OpenSSH to talk to containers without asking for it.
# TrainJob controller mounts the .ssh folder from a Secret.
# Disable UserKnownHostsFile to avoid write permissions on .ssh folder.
# Disabling StrictModes avoids directory and files read permission checks.
RUN sed -i "s/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g" /etc/ssh/ssh_config
RUN echo " UserKnownHostsFile /dev/null" >>/etc/ssh/ssh_config
RUN sed -i "s/[ #]\(.*Port \).*/ \1${PORT}/g" /etc/ssh/ssh_config
RUN sed -i "s/#\(StrictModes \).*/\1no/g" /etc/ssh/sshd_config
RUN sed -i "s/#\(Port \).*/\1${PORT}/g" /etc/ssh/sshd_config

1E11
RUN useradd -m mpiuser
WORKDIR /home/mpiuser

# Configurations for running sshd as non-root.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Configurations for running sshd as non-root.
# Configurations for running sshd as non-root.
COPY --from=mpi /home/mpiuser/.sshd_config /home/mpiuser/.sshd_config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this file to avoid managing sshd config file in this repository

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link
Member
@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you
/lgtm
/approve

/hold

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member Author

Thanks everyone for the review 🚀
/hold cancel

@google-oss-prow google-oss-prow bot merged commit aa0e289 into kubeflow:master Mar 24, 2025
17 checks passed
@andreyvelich andreyvelich deleted the deepspeed-runtime branch March 24, 2025 19:35
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
…ner#2559)

* feat(runtimes): Support DeepSpeed Runtime

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Downgrade OpenMPI to 4.0 version

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix the runtime spec

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Reuse sshd config from MPI operator

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
…ner#2559)

* feat(runtimes): Support DeepSpeed Runtime

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Downgrade OpenMPI to 4.0 version

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix the runtime spec

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Reuse sshd config from MPI operator

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
…ner#2559)

* feat(runtimes): Support DeepSpeed Runtime

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Downgrade OpenMPI to 4.0 version

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix the runtime spec

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Reuse sshd config from MPI operator

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 5, 2025
…ner#2559)

* feat(runtimes): Support DeepSpeed Runtime

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Downgrade OpenMPI to 4.0 version

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix the runtime spec

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Reuse sshd config from MPI operator

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create DeepSpeed Runtime with Kubeflow Trainer
4 participants
0