-
Notifications
You must be signed in to change notification settings - Fork 786
KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2 #2672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Xinmin Du <2812493086@qq.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 15849813368Details
💛 - Coveralls |
Signed-off-by: Xinmin Du <2812493086@qq.com>
2c6f9c6
to
bbc1cee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Doris-xm Thanks for this great work! I've left my initial comments for you.
/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc
|
||
**Kubeflow Trainer** is a core component of the Kubeflow ecosystem, responsible for managing and executing distributed training jobs. In distributed training scenarios, an efficient **scheduling mechanism** is crucial: | ||
|
||
- A distributed training job typically involves multiple pods (such as parameter servers and worker nodes) running in coordination. To avoid the resource wastage, all pods need to be started at the same time. That’s why **Gang Scheduling** matters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PS-Worker paradigm is unique to TensorFlow. Since we decide to remove the TF support in Trainer V2, can you replace it with a new example?
REF: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1749951840811039
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing. I will take the torchrun
training process as an example.
As shown in the diagram, users can utilize advanced scheduling in two ways: | ||
|
||
1. By specifying the scheduling spec when customizing *ClusterTrainingRuntime* / *TrainRuntime*. Suitable for platform engineers who are familiar with the Kubernetes API and the Volcano scheduler. | ||
2. By choosing a *TrainingRuntime* with a specific scheduling method in the *TrainJob*. Suitable for data scientists who don't need to understand the underlying implementation details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, the two ways you mentioned above are not two separate process. In fact, platform engineer creates and manages the CTRs/TRs, from which data scientists will choose one and apply a TrainJob over that. So, it's a single process. WDYT? @Doris-xm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The two ways are not unrelated processes. They are in an upstream and downstream relationship. So how about describing it as a two-stage workflow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYR, you can describe it in temporal order, like:
- First, Platform Engineers...
- Then, Data Scientists will choose and...
For the detailed expression, you could refer to: #2437 (comment)
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: rudeigerc. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@Doris-xm I also recommend that you could change the title of this PR to either:
It will be more clear and neat, I think:) |
Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
/rerun-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Doris-xm Thanks for the updates. It looks good to me!
/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc
|
||
## Design Details | ||
|
||
As shown in the workflow diagram above, the Volcano plugin's work includes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As shown in the workflow diagram above, the Volcano plugin's work includes: | |
As shown in the workflow diagram above, we decide to implement a runtime plugin for Volcano with the Kubeflow Trainer Pipeline Framework. It will: |
We'd better elaborate "what is the plugin" in Design Details. WDYT?
|
||
As shown in the workflow diagram above, the Volcano plugin's work includes: | ||
|
||
- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`). | |
- **Build PodGroups** based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`). |
For emphasis:)
As shown in the workflow diagram above, the Volcano plugin's work includes: | ||
|
||
- Build PodGroups based on the *Training Runtime* configuration and calculate resource limits (e.g., `MinResource`). | ||
- Manage PodGroups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Manage PodGroups. | |
- **Manage PodGroups** |
Same as above
- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`). | ||
- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`). | |
- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.) | |
- Update: Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`). | |
- Suspended/Resumed: Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.) |
Does adding these sub-titles better? WDYT @Doris-xm
- Manage PodGroups. | ||
- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`). | ||
- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.) | ||
- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted. | |
- **Binding**: Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted. |
- Update PodGroups and perform rescheduling when there are changes in cluster resource demands (e.g., changes in `LimitRange`). | ||
- Support scheduling for suspended and resumed training jobs, with special handling of suspended jobs to ensure no new pods are started. (TrainJob may be set to suspend in its configuration or manually paused by the user.) | ||
- Bind PodGroups to TrainJobs, with their life cycle controlled by the TrainJob. For example, when a TrainJob is deleted, the associated PodGroup is also deleted. | ||
- Integrate with Volcano control components, submitting tasks to the Volcano Scheduler for scheduling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned below, the actual scheduling process is handled by volcano. So, why we need to "Integrate with Volcano control components" here?
I guess, maybe what you mean is "Applying PodGroups to cluster"?
Specifically, we create a new structure, `VolcanoPodPolicySource`, to store the Volcano scheduling configuration in `pkg/api/trainer/trainingruntime_type.go`. It will be added as an additional option within the `PodGroupPolicySource`, alongside Coscheduling. The key fields to configure are as follows: | ||
|
||
* `Queue`: The queue name used in Volcano. Defaults to the “default” queue, which has the lowest weight. | ||
* `PriorityClassName`: If specified, this indicates the PodGroup’s priority. (For example, "system-node-critical" and "system-cluster-critical" are special keywords that indicate the highest priorities, with the former being the highest.) This field is optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating a code blocks to elaborate the detailed design?
trainer/docs/proposals/2170-kubeflow-trainer-v2/README.md
Lines 1032 to 1051 in b71a690
```golang | |
type PodGroupPolicy struct { | |
// Configuration for gang-scheduling using various plugins. | |
PodGroupPolicySource `json:",inline"` | |
} | |
// Only one of its members may be specified. | |
type PodGroupPolicySource struct { | |
// Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling. | |
Coscheduling *CoschedulingPodGroupPolicySource `json:"coscheduling,omitempty"` | |
} | |
// The number of min members in the PodGroupSpec is always equal to the number of nodes. | |
type CoschedulingPodGroupPolicySource struct { | |
// Time threshold to schedule PodGroup for gang-scheduling. | |
// If the scheduling timeout is equal to 0, the default value is used. | |
// Defaults to 60 seconds. | |
ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"` | |
} | |
``` |
|
||
The main installation steps are as follows: | ||
|
||
1. **Install Volcano** (users must install it beforehand). A deployment YAML file ([volcano-development.yaml](https://raw.githubusercontent.com/volcano-sh/volcano/release-1.10/installer/volcano-development.yaml)) is provided. The key CRDs include *PodGroup*, *Queue*. The main control components include *controller-manager*, *admission*, and *scheduler*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the volcano manifest to Trainer's manifest?
WDYT? @Doris-xm @andreyvelich @tenzen-y @astefanutti @rudeigerc
The main installation steps are as follows: | ||
|
||
1. **Install Volcano** (users must install it beforehand). A deployment YAML file ([volcano-development.yaml](https://raw.githubusercontent.com/volcano-sh/volcano/release-1.10/installer/volcano-development.yaml)) is provided. The key CRDs include *PodGroup*, *Queue*. The main control components include *controller-manager*, *admission*, and *scheduler*. | ||
2. **Modify the RBAC permissions in the manifest.** We should ensure that Trainer has the necessary management rights for the Volcano PodGroup, Queue CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to modify the RBAC permission directly in the manifest. In fact, we'll:
- Add some annotations in the runtime plugin, Like:
// +kubebuilder:rbac:groups=scheduling.x-k8s.io,resources=podgroups,verbs=create;get;list;watch;update;patch |
- Run
make manifest
to update RBAC permission automatically
Lines 114 to 121 in b71a690
# Instructions for code generation. | |
.PHONY: manifests | |
manifests: controller-gen ## Generate manifests. | |
$(CONTROLLER_GEN) "crd:generateEmbeddedObjectMeta=true" rbac:roleName=kubeflow-trainer-controller-manager webhook \ | |
paths="./pkg/apis/trainer/v1alpha1/...;./pkg/controller/...;./pkg/runtime/...;./pkg/webhooks/...;./pkg/util/cert/..." \ | |
output:crd:artifacts:config=manifests/base/crds \ | |
output:rbac:artifacts:config=manifests/base/rbac \ | |
output:webhook:artifacts:config=manifests/base/webhook |
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: rudeigerc. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@Doris-xm Btw, you can use |
What this PR does / why we need it:
This PR convert GSoC proposal to KEP. The detailed description of the project is Project 10: Support Volcano Scheduler in Kubeflow Trainer.
Which issue(s) this PR fixes :
Part of #2671
Checklist: