8000 fix(sdk): Using correct entrypoint for mpirun by andreyvelich · Pull Request #2552 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix(sdk): Using correct entrypoint for mpirun #2552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 21, 2025

Conversation

andreyvelich
Copy link
Member
@andreyvelich andreyvelich commented Mar 20, 2025

For the MPI use-cases the Python file must be available on all Pods: Launcher + Node.
Thus, we should modify the command and args compare to torchrun.

After this change, the container command and args looks as follows for mpirun:

args:
- |2-

  read -r -d '' SCRIPT << EOM

  def train_func():
    print("script here")

  train_func()

  EOM
  printf "%s" "$SCRIPT" > "/home/mpiuser/4070280887.py"
  python3 "/home/mpiuser/4070280887.py"
command:
- mpirun
- --hostfile
- /etc/mpi/hostfile
- -x
- LD_LIBRARY_PATH=/usr/local/lib/
- bash
- -c

/assign @tenzen-y @kubeflow/wg-training-leads @Electronic-Waste @astefanutti @saileshd1402

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@coveralls
Copy link
coveralls commented Mar 20, 2025

Pull Request Test Coverage Report for Build 13990912640

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 26 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.09%) to 64.403%

Files with Coverage Reduction New Missed Lines %
pkg/util/testing/wrapper.go 9 99.04%
pkg/webhooks/trainingruntime_webhook.go 17 48.57%
Totals Coverage Status
Change from base Build 13976320297: 0.09%
Covered Lines: 1688
Relevant Lines: 2621

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

cc @vsoch in case you are interested how do we place user's code into distributed MPI nodes using mpirun command and Python SDK.

Copy link
Member
@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

/lgtm
/approve
/hold

container_command = runtime.trainer.entrypoint
python_entrypoint = "python"
# mpirun uses file from this location: /home/mpiuser/<FILE_NAME>.py
func_file = os.path.join("/home", "mpiuser", func_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of user container image, this mpiuser directory does not exist.
So, could you open an issue to address arbitrary USER instead of this fixed mpiuser?

Copy link
Member Author
@andreyvelich andreyvelich Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be tricky for us to support arbitrary USER name here, since SDK user might don't know the structure of Docker container.
For now, could we say that the MPI-based runtimes must define mpiuser to run it ?
I am planning to use this user in the DeepSpeed runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, yes. We can enforce the USER, mpiuser.
However, in the future, we want to support switching base images since upstream can not support all of NVIDIA CUDA and the base images.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I added the env variable that user can overrides in case directory is different: 39e50f1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you
/lgtm

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow bot added lgtm and removed lgtm labels Mar 21, 2025
@andreyvelich
Copy link
Member Author

/hold cancel

8000

@google-oss-prow google-oss-prow bot merged commit 8aa97a4 into kubeflow:master Mar 21, 2025
16 checks passed
@andreyvelich andreyvelich deleted the sdk-fix-mpirun branch March 21, 2025 13:03
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
* fix(sdk): Using correct entrypoint for mpirun

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix torchrun entrypoint

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Allow to configure mpiuser home dir

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
* fix(sdk): Using correct entrypoint for mpirun

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix torchrun entrypoint

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Allow to configure mpiuser home dir

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
* fix(sdk): Using correct entrypoint for mpirun

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix torchrun entrypoint

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Allow to configure mpiuser home dir

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 5, 2025
* fix(sdk): Using correct entrypoint for mpirun

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix torchrun entrypoint

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Allow to configure mpiuser home dir

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0