8000 KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 by Doris-xm · Pull Request #2672 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 #2672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

Doris-xm
Copy link

What this PR does / why we need it:

This PR convert GSoC proposal to KEP. The detailed description of the project is Project 10: Support Volcano Scheduler in Kubeflow Trainer.

Which issue(s) this PR fixes :
Part of #2671

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Xinmin Du <2812493086@qq.com>
@google-oss-prow google-oss-prow bot requested a review from jinchihe June 16, 2025 13:44
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link
coveralls commented Jun 16, 2025

Pull Request Test Coverage Report for Build 15849813368

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 29.19%

Totals Coverage Status
Change from base Build 15579727901: 0.0%
Covered Lines: 897
Relevant Lines: 3073

💛 - Coveralls

Signed-off-by: Xinmin Du <2812493086@qq.com>
@Doris-xm Doris-xm force-pushed the KEP-volcano-scheduler branch from 2c6f9c6 to bbc1cee Compare June 16, 2025 14:04
Copy link
Member
@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm Thanks for this great work! I've left my initial comments for you.

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc


**Kubeflow Trainer** is a core component of the Kubeflow ecosystem, responsible for managing and executing distributed training jobs. In distributed training scenarios, an efficient **scheduling mechanism** is crucial:

- A distributed training job typically involves multiple pods (such as parameter servers and worker nodes) running in coordination. To avoid the resource wastage, all pods need to be started at the same time. That’s why **Gang Scheduling** matters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PS-Worker paradigm is unique to TensorFlow. Since we decide to remove the TF support in Trainer V2, can you replace it with a new example?

REF: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1749951840811039

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I will take the torchrun training process as an example.

Comment on lines 48 to 51
As shown in the diagram, users can utilize advanced scheduling in two ways:

1. By specifying the scheduling spec when customizing *ClusterTrainingRuntime* / *TrainRuntime*. Suitable for platform engineers who are familiar with the Kubernetes API and the Volcano scheduler.
2. By choosing a *TrainingRuntime* with a specific scheduling method in the *TrainJob*. Suitable for data scientists who don't need to understand the underlying implementation details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the two ways you mentioned above are not two separate process. In fact, platform engineer creates and manages the CTRs/TRs, from which data scientists will choose one and apply a TrainJob over that. So, it's a single process. WDYT? @Doris-xm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The two ways are not unrelated processes. They are in an upstream and downstream relationship. So how about describing it as a two-stage workflow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYR, you can describe it in temporal order, like:

  1. First, Platform Engineers...
  2. Then, Data Scientists will choose and...

For the detailed expression, you could refer to: #2437 (comment)

@google-oss-prow google-oss-prow bot requested review from a team and astefanutti June 23, 2025 09:44
Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: rudeigerc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@Doris-xm Thanks for this great work! I've left my initial comments for you.

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste
Copy link
Member

@Doris-xm I also recommend that you could change the title of this PR to either:

  1. KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2
  2. [GSoC] Project 10: Support Volcano Scheduler in Kubeflow Trainer V2

It will be more clear and neat, I think:)

Doris-xm and others added 11 commits June 24, 2025 16:20
Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
@Doris-xm Doris-xm changed the title KEP-2437: Creating Kubeflow Enhancement Proposal KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 Jun 24, 2025
Doris-xm added 2 commits June 24, 2025 18:47
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
@Doris-xm
Copy link
Author

/rerun-all

Copy link
Member
@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm Thanks for the updates. It looks good to me!

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc


## Design Details

As shown in the workflow diagram above, the Volcano plugin's work includes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As shown in the workflow diagram above, the Volcano plugin's work includes:
As shown in the workflow diagram above, we decide to implement a runtime plugin for Volcano with the Kubeflow Trainer Pipeline Framework. It will:

We'd better elaborate "what is the plugin" in Design Details. WDYT?


As shown in the workflow diagram above, the Volcano plugin's work includes:

- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
F438
- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`).
- **Build PodGroups** based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`).

For emphasis:)

As shown in the workflow diagram above, the Volcano plugin's work includes:

- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`).
- Manage PodGroups.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Manage PodGroups.
- **Manage PodGroups**

Same as above

Comment on lines +133 to +134
- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)
- Update: Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
- Suspended/Resumed: Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)

Does adding these sub-titles better? WDYT @Doris-xm

- Manage PodGroups.
- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)
- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.
- **Binding**: Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.

- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)
- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.
- Integrate with Volcano control components, submitting tasks to the Volcano Scheduler for scheduling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned below, the actual scheduling process is handled by volcano. So, why we need to "Integrate with Volcano control components" here?

I guess, maybe what you mean is "Applying PodGroups to cluster"?

Comment on lines +144 to +147
Specifically, we create a new structure, `VolcanoPodPolicySource`, to store the Volcano scheduling configuration in `pkg/api/trainer/trainingruntime_type.go`. It will be added as an additional option within the `PodGroupPolicySource`, alongside Coscheduling. The key fields to configure are as follows:

* `Queue`: The queue name used in Volcano. Defaults to the “default” queue, which has the lowest weight.
* `PriorityClassName`: If specified, this indicates the PodGroup’s priority. (For example, "system-node-critical" and "system-cluster-critical" are special keywords that indicate the highest priorities, with the former being the highest.) This field is optional.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a code blocks to elaborate the detailed design?

```golang
type PodGroupPolicy struct {
// Configuration for gang-scheduling using various plugins.
PodGroupPolicySource `json:",inline"`
}
// Only one of its members may be specified.
type PodGroupPolicySource struct {
// Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling.
Coscheduling *CoschedulingPodGroupPolicySource `json:"coscheduling,omitempty"`
}
// The number of min members in the PodGroupSpec is always equal to the number of nodes.
type CoschedulingPodGroupPolicySource struct {
// Time threshold to schedule PodGroup for gang-scheduling.
// If the scheduling timeout is equal to 0, the default value is used.
// Defaults to 60 seconds.
ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"`
}
```


The main installation steps are as follows:

1. **Install Volcano** (users must install it beforehand). A deployment YAML file ([volcano-development.yaml](https://raw.githubusercontent.com/volcano-sh/volcano/release-1.10/installer/volcano-development.yaml)) is provided. The key CRDs include *PodGroup*, *Queue*. The main control components include *controller-manager*, *admission*, and *scheduler*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add the volcano manifest to Trainer's manifest?

WDYT? @Doris-xm @andreyvelich @tenzen-y @astefanutti @rudeigerc

The main installation steps are as follows:

1. **Install Volcano** (users must install it beforehand). A deployment YAML file ([volcano-development.yaml](https://raw.githubusercontent.com/volcano-sh/volcano/release-1.10/installer/volcano-development.yaml)) is provided. The key CRDs include *PodGroup*, *Queue*. The main control components include *controller-manager*, *admission*, and *scheduler*.
2. **Modify the RBAC permissions in the manifest.** We should ensure that Trainer has the necessary management rights for the Volcano PodGroup, Queue CRD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to modify the RBAC permission directly in the manifest. In fact, we'll:

  1. Add some annotations in the runtime plugin, Like:

// +kubebuilder:rbac:groups=scheduling.x-k8s.io,resources=podgroups,verbs=create;get;list;watch;update;patch

  1. Run make manifest to update RBAC permission automatically

trainer/Makefile

Lines 114 to 121 in b71a690

# Instructions for code generation.
.PHONY: manifests
manifests: controller-gen ## Generate manifests.
$(CONTROLLER_GEN) "crd:generateEmbeddedObjectMeta=true" rbac:roleName=kubeflow-trainer-controller-manager webhook \
paths="./pkg/apis/trainer/v1alpha1/...;./pkg/controller/...;./pkg/runtime/...;./pkg/webhooks/...;./pkg/util/cert/..." \
output:crd:artifacts:config=manifests/base/crds \
output:rbac:artifacts:config=manifests/base/rbac \
output:webhook:artifacts:config=manifests/base/webhook

Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: rudeigerc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@Doris-xm Thanks for the updates. It looks good to me!

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste
Copy link
Member
Electronic-Waste commented Jun 24, 2025

@Doris-xm Btw, you can use /retest to only trigger the failed test cases:)

Signed-off-by: Xinmin Du <2812493086@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0