fix(sdk): Using correct entrypoint for mpirun #2552

andreyvelich · 2025-03-20T21:40:07Z

For the MPI use-cases the Python file must be available on all Pods: Launcher + Node.
Thus, we should modify the command and args compare to torchrun.

After this change, the container command and args looks as follows for mpirun:

args:
- |2-

  read -r -d '' SCRIPT << EOM

  def train_func():
    print("script here")

  train_func()

  EOM
  printf "%s" "$SCRIPT" > "/home/mpiuser/4070280887.py"
  python3 "/home/mpiuser/4070280887.py"
command:
- mpirun
- --hostfile
- /etc/mpi/hostfile
- -x
- LD_LIBRARY_PATH=/usr/local/lib/
- bash
- -c

/assign @tenzen-y @kubeflow/wg-training-leads @Electronic-Waste @astefanutti @saileshd1402

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

coveralls · 2025-03-20T21:44:20Z

Pull Request Test Coverage Report for Build 13990912640

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
26 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.09%) to 64.403%

Files with Coverage Reduction	New Missed Lines	%
pkg/util/testing/wrapper.go	9	99.04%
pkg/webhooks/trainingruntime_webhook.go	17	48.57%

Totals
Change from base Build 13976320297:	0.09%
Covered Lines:	1688
Relevant Lines:	2621

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-03-21T02:18:44Z

cc @vsoch in case you are interested how do we place user's code into distributed MPI nodes using mpirun command and Python SDK.

tenzen-y

Thank you

/lgtm
/approve
/hold

tenzen-y · 2025-03-21T09:26:23Z

sdk/kubeflow/trainer/utils/utils.py

+        container_command = runtime.trainer.entrypoint
+        python_entrypoint = "python"
+        # mpirun uses file from this location: /home/mpiuser/<FILE_NAME>.py
+        func_file = os.path.join("/home", "mpiuser", func_file)


In case of user container image, this mpiuser directory does not exist.
So, could you open an issue to address arbitrary USER instead of this fixed mpiuser?

It might be tricky for us to support arbitrary USER name here, since SDK user might don't know the structure of Docker container.
For now, could we say that the MPI-based runtimes must define mpiuser to run it ?
I am planning to use this user in the DeepSpeed runtime.

For now, yes. We can enforce the USER, mpiuser.
However, in the future, we want to support switching base images since upstream can not support all of NVIDIA CUDA and the base images.

Good point, I added the env variable that user can overrides in case directory is different: 39e50f1

Thank you
/lgtm

google-oss-prow · 2025-03-21T09:26:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-03-21T11:44:49Z

/hold cancel

8000

* fix(sdk): Using correct entrypoint for mpirun Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix torchrun entrypoint Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Allow to configure mpiuser home dir Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

fix(sdk): Using correct entrypoint for mpirun

90436ef

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot requested review from jinchihe and kuizhiqing March 20, 2025 21:40

google-oss-prow bot added the size/M label Mar 20, 2025

Fix torchrun entrypoint

30af02e

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y reviewed Mar 21, 2025

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Mar 21, 2025

google-oss-prow bot assigned tenzen-y Mar 21, 2025

google-oss-prow bot added the lgtm label Mar 21, 2025

google-oss-prow bot added the approved label Mar 21, 2025

Allow to configure mpiuser home dir

39e50f1

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added lgtm and removed lgtm labels Mar 21, 2025

google-oss-prow bot removed the do-not-merge/hold label Mar 21, 2025

google-oss-prow bot merged commit 8aa97a4 into kubeflow:master Mar 21, 2025
16 checks passed

andreyvelich deleted the sdk-fix-mpirun branch March 21, 2025 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sdk): Using correct entrypoint for mpirun #2552

fix(sdk): Using correct entrypoint for mpirun #2552

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix(sdk): Using correct entrypoint for mpirun #2552

fix(sdk): Using correct entrypoint for mpirun #2552

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Pull Request Test Coverage Report for Build 13990912640

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!