-
Notifications
You must be signed in to change notification settings - Fork 549
[RayJob] Support deletion policies based on job status #3731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob] Support deletion policies based on job status #3731
Conversation
0ef4ba4
to
4b19d17
Compare
type DeleteResource string | ||
|
||
type DeletionPolicy struct { | ||
OnSuccess *DeletionConfig `json:"on_success"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convention for json tags is to use camel case onSuccess
, onFailure
, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
type DeletionConfig struct { | ||
// Valid values are 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf' or 'DeleteNone'. | ||
// +kubebuilder:validation:XValidation:rule="self in ['DeleteCluster', 'DeleteWorkers', 'DeleteSelf', 'DeleteNone']",message="the deleteResource field value must be either 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf', or 'DeleteNone'" | ||
DeleteResource *DeleteResource `json:"delete_resource"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on discussions in #3714 I think we want this field to be policy
or something simlar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weizhaowz can you test it manually as well and share results?
policy = *rayJobInstance.Spec.DeletionPolicy.OnSuccess.Policy | ||
} else if rayJobInstance.Status.JobStatus == rayv1.JobStatusFailed { | ||
policy = *rayJobInstance.Spec.DeletionPolicy.OnFailure.Policy | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an else here with error logging, there are other job status' to consider
kuberay/ray-operator/apis/ray/v1alpha1/rayjob_types.go
Lines 14 to 21 in 280902f
// https://docs.ray.io/en/latest/cluster/running-applications/job-submission/jobs-package-ref.html#jobstatus | |
const ( | |
JobStatusNew JobStatus = "" | |
JobStatusPending JobStatus = "PENDING" | |
JobStatusRunning JobStatus = "RUNNING" | |
JobStatusStopped JobStatus = "STOPPED" | |
JobStatusSucceeded JobStatus = "SUCCEEDED" | |
JobStatusFailed JobStatus = "FAILED" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
} else if rayJobInstance.Status.JobStatus == rayv1.JobStatusFailed { | ||
policy = *rayJobInstance.Spec.DeletionPolicy.OnFailure.Policy | ||
} else { | ||
logger.Error(errors.NewBadRequest("the job's status is unrecognized for deletion"), "job status", string(rayJobInstance.Status.JobStatus)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log info instead and requeue since the job status could change in the next reconcile:
logger.Info("jobStatus not valid for deletion", "jobStatus", rayJobInstance.Status.JobStatus)
return ctrl.Result{RequeueAfter: time.Duration(delta) * time.Second}, nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context, it's updated. If now time is already later than should shut down time, we don't need to wait for the status change, is it correct?
len(rayJobInstance.Spec.ClusterSelector) == 0 { | ||
|
||
policy := rayv1.DeleteNone | ||
if rayJobInstance.Status.JobStatus == rayv1.JobStatusSucceeded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected behavior if only onSuccess or onFailure is set? What should be the default deletion policy in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ray-operator/controllers/ray/utils/validation.go, it's validated that both onSuccess and onFailure should be set, otherwise it return.
@@ -84,13 +84,24 @@ const ( | |||
InteractiveMode JobSubmissionMode = "InteractiveMode" // Don't submit job in KubeRay. Instead, wait for user to submit job and provide the job submission ID. | |||
) | |||
|
|||
type DeletionPolicy string | |||
type Policy string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just type Policy string
is too generic I think, should update to:
type DeletionPolicyType string
Or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
type DeletionPolicy string | ||
type Policy string | ||
|
||
type DeletionPolicy struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this naming instead:
type DeletionStrategy struct {
OnSuccess DeletionPolicy `json:"onSuccess"`
OnFailure DeletionPolicy `json:"onFailure"`
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think strategy is good as we already us policy to name the field.
|
||
type DeletionPolicy struct { | ||
OnSuccess *DeletionConfig `json:"onSuccess"` | ||
OnFailure *DeletionConfig `json:"onFailure"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeletionConfig doesn't need to be pointer since it's a required field if DeletionPolicy is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
OnFailure *DeletionConfig `json:"onFailure"` | ||
} | ||
|
||
type DeletionConfig struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type DeletionPolicy struct {
Policy DeletionPolicyType `json:"policy"`
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
// +optional | ||
DeletionPolicy *DeletionPolicy `json:"deletionPolicy,omitempty"` | ||
DeletionStrategy *DeletionStrategy `json:"deletionPolicy,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change json tag to deletionStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
@@ -23,7 +23,7 @@ const ( | |||
// alpha: v1.3 | |||
// | |||
// Enables new deletion policy API in RayJob | |||
RayJobDeletionPolicy featuregate.Feature = "RayJobDeletionPolicy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment 10000
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can change the feature gate name, leave it as RayJobDeletionPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
I'm looking forward to trying this out. Can we add or update a sample RayJob YAML showing how to use with the various options? Might be helpful for e2e test if you didn't write one already. |
Hi, I share the manual test doc with you. The RayJob is from ray-operator/config/samples/ray-job.sample.yaml and I add the deletionStrategy into the Spec. |
tested, here is the doc and the result looks good. |
Manually Test Rayjob Deletion Policy
Create cluster and enable RayJobDeletionPolicy in helm-chart/kuberay-operator/values.yaml and then build KubeRay operator Confirm the feature is enable Run the above RayJob rayjob-sample.yaml Observe the job Confirm the cluster is deleted when the job succeeded Expect RayJob Fails and Workers deleted
Repeat step #1, #2, and #3 from the previous session to deploy the job Observe the job Confirm the cluster is deleted when the job failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just some minor comments. @kevin85421 @davidxia PTAL if you have a chance
// If unset, deletion policy is based on 'spec.shutdownAfterJobFinishes'. | ||
// This field requires the RayJobDeletionPolicy feature gate to be enabled. | ||
// +kubebuilder:validation:XValidation:rule="self in ['DeleteCluster', 'DeleteWorkers', 'DeleteSelf', 'DeleteNone']",message="the deletionPolicy field value must be either 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf', or 'DeleteNone'" | ||
// This field requires the RayJobDeletionStrategy feature gate to be enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field requires the RayJobDeletionPolicy
feature gate to be enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
} else if rayJobInstance.Status.JobStatus == rayv1.JobStatusFailed { | ||
policy = *rayJobInstance.Spec.DeletionStrategy.OnFailure.Policy | ||
} else { | ||
if shutdownTime.After(nowTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this if? Shouldn't we always check for invalid job status when we reached TTL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition should probably go at the top of this condition on line 387
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it up to line 387, but then the condition in line 407 will be redundant, we should remove it. Because now the logic is: 1. if shutdown time not reached, requeue the rayjob, otherwise, 2. check the job status, if it's an invalid status for deletion, the default deletion policy is delete none, so it breaks in line 402, therefore we don't need to check if the shutdown time is reached or not in line 407. Is it correct? Thanks.
Will do. Busy last week. Will try to test manually today or tomorrow. |
manually tested onSuccess with DeleteWorkers, DeleteCluster, and DeleteSelf. They work as expected! |
Thanks @davidxia ! |
Sorry for the late response because I'm still working on the release process. Thank you for the contribution! |
Why are these changes needed?
Support different deletion policy when RayJob ends in different status
Related issue number
[Feature] Update RayJob DeletionPolicy API to differentiate success/failure scenarios #3714
Checks