KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 #2672

Doris-xm · 2025-06-16T13:43:56Z

What this PR does / why we need it:

This PR convert GSoC proposal to KEP. The detailed description of the project is Project 10: Support Volcano Scheduler in Kubeflow Trainer.

Which issue(s) this PR fixes :
Part of #2671

Checklist:

Docs included if any changes are user facing

Signed-off-by: Xinmin Du <2812493086@qq.com>

google-oss-prow · 2025-06-16T13:44:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-06-16T13:50:55Z

Pull Request Test Coverage Report for Build 15849813368

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 29.19%

Totals
Change from base Build 15579727901:	0.0%
Covered Lines:	897
Relevant Lines:	3073

💛 - Coveralls

Signed-off-by: Xinmin Du <2812493086@qq.com>

Electronic-Waste

@Doris-xm Thanks for this great work! I've left my initial comments for you.

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc

docs/proposals/2437-volcano-scheduler/README.md

Electronic-Waste · 2025-06-23T08:18:01Z

docs/proposals/2437-volcano-scheduler/README.md

+
+**Kubeflow Trainer** is a core component of the Kubeflow ecosystem, responsible for managing and executing distributed training jobs. In distributed training scenarios, an efficient **scheduling mechanism** is crucial:
+
+- A distributed training job typically involves multiple pods (such as parameter servers and worker nodes) running in coordination. To avoid the resource wastage, all pods need to be started at the same time. That’s why **Gang Scheduling** matters.


The PS-Worker paradigm is unique to TensorFlow. Since we decide to remove the TF support in Trainer V2, can you replace it with a new example?

REF: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1749951840811039

Thanks for reviewing. I will take the torchrun training process as an example.

docs/proposals/2437-volcano-scheduler/README.md

Electronic-Waste · 2025-06-23T09:41:46Z

docs/proposals/2437-volcano-scheduler/README.md

+As shown in the diagram, users can utilize advanced scheduling in two ways:
+
+1. By specifying the scheduling spec when customizing *ClusterTrainingRuntime* / *TrainRuntime*. Suitable for platform engineers who are familiar with the Kubernetes API and the Volcano scheduler.
+2. By choosing a *TrainingRuntime* with a specific scheduling method in the *TrainJob*. Suitable for data scientists who don't need to understand the underlying implementation details.


I think, the two ways you mentioned above are not two separate process. In fact, platform engineer creates and manages the CTRs/TRs, from which data scientists will choose one and apply a TrainJob over that. So, it's a single process. WDYT? @Doris-xm

Yes. The two ways are not unrelated processes. They are in an upstream and downstream relationship. So how about describing it as a two-stage workflow？

FYR, you can describe it in temporal order, like:

First, Platform Engineers...

Then, Data Scientists will choose and...

For the detailed expression, you could refer to: #2437 (comment)

google-oss-prow · 2025-06-23T09:44:18Z

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: rudeigerc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@Doris-xm Thanks for this great work! I've left my initial comments for you.

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Electronic-Waste · 2025-06-23T09:48:28Z

@Doris-xm I also recommend that you could change the title of this PR to either:

KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2
[GSoC] Project 10: Support Volcano Scheduler in Kubeflow Trainer V2

It will be more clear and neat, I think:)

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

Signed-off-by: Xinmin Du <2812493086@qq.com>

Doris-xm · 2025-06-24T11:20:51Z

/rerun-all

Electronic-Waste

@Doris-xm Thanks for the updates. It looks good to me!

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc

Electronic-Waste · 2025-06-24T11:37:46Z

docs/proposals/2437-volcano-scheduler/README.md

+
+## Design Details
+
+As shown in the workflow diagram above, the Volcano plugin's work includes:


Suggested change

As shown in the workflow diagram above, the Volcano plugin's work includes:

As shown in the workflow diagram above, we decide to implement a runtime plugin for Volcano with the Kubeflow Trainer Pipeline Framework. It will:

We'd better elaborate "what is the plugin" in Design Details. WDYT?

Electronic-Waste · 2025-06-24T11:38:53Z

docs/proposals/2437-volcano-scheduler/README.md

+
+As shown in the workflow diagram above, the Volcano plugin's work includes:
+
+- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`).


Suggested change
F438

- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`).

- **Build PodGroups** based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`).

For emphasis:)

Electronic-Waste · 2025-06-24T11:39:12Z

docs/proposals/2437-volcano-scheduler/README.md

+As shown in the workflow diagram above, the Volcano plugin's work includes:
+
+- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`).
+- Manage PodGroups.


Suggested change

- Manage PodGroups.

- **Manage PodGroups**

Same as above

Electronic-Waste · 2025-06-24T11:43:10Z

docs/proposals/2437-volcano-scheduler/README.md

+   - Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
+   - Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)


Suggested change

- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).

- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)

- Update: Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).

- Suspended/Resumed: Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)

Does adding these sub-titles better? WDYT @Doris-xm

Electronic-Waste · 2025-06-24T11:43:44Z

docs/proposals/2437-volcano-scheduler/README.md

+- Manage PodGroups.
+   - Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
+   - Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)
+- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.


Suggested change

- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.

- **Binding**: Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.

Electronic-Waste · 2025-06-24T11:45:58Z

docs/proposals/2437-volcano-scheduler/README.md

+   - Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
+   - Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)
+- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.
+- Integrate with Volcano control components, submitting tasks to the Volcano Scheduler for scheduling.


As you mentioned below, the actual scheduling process is handled by volcano. So, why we need to "Integrate with Volcano control components" here?

I guess, maybe what you mean is "Applying PodGroups to cluster"?

Electronic-Waste · 2025-06-24T11:48:28Z

docs/proposals/2437-volcano-scheduler/README.md

+Specifically, we create a new structure, `VolcanoPodPolicySource`, to store the Volcano scheduling configuration in `pkg/api/trainer/trainingruntime_type.go`. It will be added as an additional option within the `PodGroupPolicySource`, alongside Coscheduling. The key fields to configure are as follows:
+
+* `Queue`: The queue name used in Volcano. Defaults to the “default” queue, which has the lowest weight.
+* `PriorityClassName`: If specified, this indicates the PodGroup’s priority. (For example, "system-node-critical" and "system-cluster-critical" are special keywords that indicate the highest priorities, with the former being the highest.) This field is optional.


How about creating a code blocks to elaborate the detailed design?

trainer/docs/proposals/2170-kubeflow-trainer-v2/README.md

Lines 1032 to 1051 in b71a690

```golang

type PodGroupPolicy struct {

// Configuration for gang-scheduling using various plugins.

PodGroupPolicySource `json:",inline"`

}

// Only one of its members may be specified.

type PodGroupPolicySource struct {

// Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling.

Coscheduling *CoschedulingPodGroupPolicySource `json:"coscheduling,omitempty"`

}

// The number of min members in the PodGroupSpec is always equal to the number of nodes.

type CoschedulingPodGroupPolicySource struct {

// Time threshold to schedule PodGroup for gang-scheduling.

// If the scheduling timeout is equal to 0, the default value is used.

// Defaults to 60 seconds.

ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"`

}

```

Electronic-Waste · 2025-06-24T11:53:07Z

docs/proposals/2437-volcano-scheduler/README.md

+
+The main installation steps are as follows:
+
+1. **Install Volcano** (users must install it beforehand). A deployment YAML file ([volcano-development.yaml](https://raw.githubusercontent.com/volcano-sh/volcano/release-1.10/installer/volcano-development.yaml)) is provided. The key CRDs include *PodGroup*, *Queue*. The main control components include *controller-manager*, *admission*, and *scheduler*.


Shall we add the volcano manifest to Trainer's manifest?

WDYT? @Doris-xm @andreyvelich @tenzen-y @astefanutti @rudeigerc

Electronic-Waste · 2025-06-24T11:55:24Z

docs/proposals/2437-volcano-scheduler/README.md

+The main installation steps are as follows:
+
+1. **Install Volcano** (users must install it beforehand). A deployment YAML file ([volcano-development.yaml](https://raw.githubusercontent.com/volcano-sh/volcano/release-1.10/installer/volcano-development.yaml)) is provided. The key CRDs include *PodGroup*, *Queue*. The main control components include *controller-manager*, *admission*, and *scheduler*.
+2. **Modify the RBAC permissions in the manifest.** We should ensure that Trainer has the necessary management rights for the Volcano PodGroup, Queue CRD.


We do not need to modify the RBAC permission directly in the manifest. In fact, we'll:

Add some annotations in the runtime plugin, Like:

trainer/pkg/runtime/framework/plugins/coscheduling/coscheduling.go

Line 71 in b71a690

// +kubebuilder:rbac:groups=scheduling.x-k8s.io,resources=podgroups,verbs=create;get;list;watch;update;patch

Run make manifest to update RBAC permission automatically

trainer/Makefile

Lines 114 to 121 in b71a690

# Instructions for code generation.

.PHONY: manifests

manifests: controller-gen ## Generate manifests.

$(CONTROLLER_GEN) "crd:generateEmbeddedObjectMeta=true" rbac:roleName=kubeflow-trainer-controller-manager webhook \

paths="./pkg/apis/trainer/v1alpha1/...;./pkg/controller/...;./pkg/runtime/...;./pkg/webhooks/...;./pkg/util/cert/..." \

output:crd:artifacts:config=manifests/base/crds \

output:rbac:artifacts:config=manifests/base/rbac \

output:webhook:artifacts:config=manifests/base/webhook

google-oss-prow · 2025-06-24T11:57:23Z

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: rudeigerc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@Doris-xm Thanks for the updates. It looks good to me!

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Electronic-Waste · 2025-06-24T11:57:58Z

@Doris-xm Btw, you can use /retest to only trigger the failed test cases:)

Signed-off-by: Xinmin Du <2812493086@qq.com>

doc: KEP creation

7a3afdd

Signed-off-by: Xinmin Du <2812493086@qq.com>

google-oss-prow bot requested a review from jinchihe June 16, 2025 13:44

google-oss-prow bot requested a review from kuizhiqing June 16, 2025 13:44

google-oss-prow bot added the size/L label Jun 16, 2025

Doris-xm mentioned this pull request Jun 16, 2025

[GSoC] Project 10: Support Volcano Scheduler in Kubeflow Trainer #2671

Open

3 tasks

fix: pre-commit fix

bbc1cee

Signed-off-by: Xinmin Du <2812493086@qq.com>

Doris-xm force-pushed the KEP-volcano-scheduler branch from 2c6f9c6 to bbc1cee Compare June 16, 2025 14:04

Electronic-Waste reviewed Jun 23, 2025

View reviewed changes

google-oss-prow bot requested review from a team and astefanutti June 23, 2025 09:44

Doris-xm and others added 11 commits June 24, 2025 16:20

Update docs/proposals/2437-volcano-scheduler/README.md

f423403

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

Update docs/proposals/2437-volcano-scheduler/README.md

1240db2

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

Update docs/proposals/2437-volcano-scheduler/README.md

be9dcf5

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

Update docs/proposals/2437-volcano-scheduler/README.md

484dcb0

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

Update docs/proposals/2437-volcano-scheduler/README.md

55ef7e8

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

Update docs/proposals/2437-volcano-scheduler/README.md

2a73c87

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

Update docs/proposals/2437-volcano-scheduler/README.md

ab203ae

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

Update docs/proposals/2437-volcano-scheduler/README.md

cfd5a9a

Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>

fix: syntax err

b6428ba

Signed-off-by: Xinmin Du <2812493086@qq.com>

fix: change the example

fbe8f5f

Signed-off-by: Xinmin Du <2812493086@qq.com>

fix: change the explanation for fig.

cd0aee3

Signed-off-by: Xinmin Du <2812493086@qq.com>

Doris-xm changed the title ~~KEP-2437: Creating Kubeflow Enhancement Proposal~~ KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 Jun 24, 2025

Doris-xm added 2 commits June 24, 2025 18:47

fix: remove unused part

f5c8b3e

Signed-off-by: Xinmin Du <2812493086@qq.com>

feat: add test plan

52641fa

Signed-off-by: Xinmin Du <2812493086@qq.com>

Electronic-Waste reviewed Jun 24, 2025

View reviewed changes

fix

1468561

Signed-off-by: Xinmin Du <2812493086@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 #2672

KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 #2672

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		Kubeflow Trainer is a core component of the Kubeflow ecosystem, responsible for managing and executing distributed training jobs. In distributed training scenarios, an efficient scheduling mechanism is crucial:

		- A distributed training job typically involves multiple pods (such as parameter servers and worker nodes) running in coordination. To avoid the resource wastage, all pods need to be started at the same time. That’s why Gang Scheduling matters.


		## Design Details

		As shown in the workflow diagram above, the Volcano plugin's work includes:


		As shown in the workflow diagram above, the Volcano plugin's work includes:

		- Build PodGroups based on the Training Runtime configuration and calculate resource limits (e.g., `MinResource`).

	- Build PodGroups based on the Training Runtime configuration and calculate resource limits (e.g., `MinResource`).
	- Build PodGroups based on the Training Runtime configuration and calculate resource limits (e.g., `MinResource`).

		- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`).
		- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.)

	- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.
	- Binding: Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted.

	```golang
	type PodGroupPolicy struct {
	// Configuration for gang-scheduling using various plugins.
	PodGroupPolicySource `json:",inline"`
	}

	// Only one of its members may be specified.
	type PodGroupPolicySource struct {
	// Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling.
	Coscheduling *CoschedulingPodGroupPolicySource `json:"coscheduling,omitempty"`
	}

	// The number of min members in the PodGroupSpec is always equal to the number of nodes.
	type CoschedulingPodGroupPolicySource struct {
	// Time threshold to schedule PodGroup for gang-scheduling.
	// If the scheduling timeout is equal to 0, the default value is used.
	// Defaults to 60 seconds.
	ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"`
	}
	```


		The main installation steps are as follows:

		1. Install Volcano (users must install it beforehand). A deployment YAML file ([volcano-development.yaml](https://raw.githubusercontent.com/volcano-sh/volcano/release-1.10/installer/volcano-development.yaml)) is provided. The key CRDs include PodGroup, Queue. The main control components include controller-manager, admission, and scheduler.

	# Instructions for code generation.
	.PHONY: manifests
	manifests: controller-gen ## Generate manifests.
	$(CONTROLLER_GEN) "crd:generateEmbeddedObjectMeta=true" rbac:roleName=kubeflow-trainer-controller-manager webhook \
	paths="./pkg/apis/trainer/v1alpha1/...;./pkg/controller/...;./pkg/runtime/...;./pkg/webhooks/...;./pkg/util/cert/..." \
	output:crd:artifacts:config=manifests/base/crds \
	output:rbac:artifacts:config=manifests/base/rbac \
	output:webhook:artifacts:config=manifests/base/webhook

KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 #2672

Are you sure you want to change the base?

KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 #2672

Conversation

Uh oh!

Uh oh!

Uh oh!

Pull Request Test Coverage Report for Build 15849813368

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!