-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Stop and resume a model [Raw Deployment] #4455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
312b5b5
to
6a5e48d
Compare
@hdefazio Can you change the references to KServe issues? |
@johnugeorge Just updated! |
e089c6d
to
84abfaf
Compare
/rerun-all |
1 similar comment
/rerun-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would've structured the feature reconciliation a little bit differently but the Serverless mode was done this way, so I'd stay consistent, but was it considered to have the "stop" be a separate reconciliation path since it's mostly deleting child resources so that not every sub-reconciler needs to be aware of the stop feature (which would also reduce the level of branching each has)?
pkg/controller/v1beta1/inferenceservice/components/predictor.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/components/predictor.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/deployment/deployment_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/deployment/deployment_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/hpa/hpa_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/service/service_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler.go
Outdated
Show resolved
Hide resolved
In CI job E2E Tests / test-path-based-routing (pull_request) I see
they look like unrelated errors |
ff55e8e
to
e663d00
Compare
pkg/controller/v1beta1/inferenceservice/reconcilers/deployment/deployment_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/kube_ingress_reconciler.go
Outdated
Show resolved
Hide resolved
// Do a dry-run update to avoid diffs generated by default values. | ||
// This will populate our local httpRoute with any default values that are present on the remote version. | ||
if err := r.client.Update(ctx, desired, client.DryRunAll); err != nil { | ||
log.Error(err, "Failed to perform dry-run update for predictor HttpRoute", "name", desired.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the dry-run, wouldn't it have any side effects? I think the cluster sets some defaults, so the controller may keep updating it constantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed semanticHttpRouteEquals to use DeepDerivative instead of DeepEqual, which should accomplish determining the diff without needing a dry run. I can change it back if you're worried about it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierDipi Suggested it to help clean up the functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good. I did see the change to DeepDerivative. I didn't relate this removal to that change.
// Do a dry-run update to avoid diffs generated by default values. | ||
// This will populate our local httpRoute with any default values that are present on the remote version. | ||
if err := o.client.Update(ctx, desired, client.DryRunAll); err != nil { | ||
log.Error(err, "Failed to perform dry-run update for OTel Collector", "name", desired.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question here, about if the missing dry-run would lead to constantly updating the resource?
pkg/controller/v1beta1/inferenceservice/reconcilers/service/service_reconciler.go
Outdated
Show resolved
Hide resolved
@@ -289,6 +299,11 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context) ([]*corev1.Service, e | |||
opErr = r.client.Create(ctx, svc) | |||
case constants.CheckResultUpdate: | |||
opErr = r.client.Update(ctx, svc) | |||
case constants.CheckResultDelete: | |||
log.Info("Deleting service", "namespace", svc.Namespace, "name", svc.Name) | |||
if svc.GetDeletionTimestamp() == nil { // check if the service was already deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No check for owner reference... Similarly, in the deployment_reconciler.
I can suggest to add the check in both places.
pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go
Outdated
Show resolved
Hide resolved
The very large error has its root in a
This is
Notice the last line
I don't think we should disable the timeout. IMO we should simply increase it to allow for the units to run properly. You can do it in the Makefile. I can suggest a 30min timeout given the results on my machine. About the failing tests, the existing ones:
apparently are just failing because the check is out of order, and both give a diff like the following:
suggesting you just need to swap the array you updated here: https://github.com/kserve/kserve/pull/4455/files#r2127716685 and here: https://github.com/kserve/kserve/pull/4455/files#r2127718390. I have to say that the different order looks suspicious, although in practice order doesn't really matter. Finally, the unit failing in rawkube_controller_test.go:2593, suggests that something is buggy. The error reported is that the Route is missing:
I don't know if it the unit is buggy, or if the code is buggy. At the moment, I know that, apparently, the HTTPRoute reconciler is not running. |
pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go
Outdated
Show resolved
Hide resolved
The problematic unit rawkube_controller_test.go:2593 is failing because the Context is buggy. Inside its containing configs := map[string]string{
"explainers": `[...snip...]`,
"ingress": `[...snip...]`,
"storageInitializer": `[...snip...]`,
} This variable is assigned inside createInferenceServiceConfigMap function, but the assignment is ordinary. Since maps are reference types, no copy is done. Thus, units requiring to turn off Gateway API, like To fix, see comment: https://github.com/kserve/kserve/pull/4455/files#r2130518090 It turns out that @pierDipi was right that Gateway API was configured as disabled. |
de87268
to
2e1ff42
Compare
…it, add initial test case Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
… vars, cleanup Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
…isvc method that does the same thing Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
…or deployment and service Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
2e1ff42
to
2513050
Compare
/rerun-all |
1 similar comment
/rerun-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @hdefazio
/approve |
@hdefazio Documenting this feature would be nice ! |
@@ -509,7 +509,7 @@ func createRawTopLevelHTTPRoute(isvc *v1beta1.InferenceService, ingressConfig *v | |||
} | |||
|
|||
func semanticHttpRouteEquals(desired, existing *gatewayapiv1.HTTPRoute) bool { | |||
return equality.Semantic.DeepEqual(desired.Spec, existing.Spec) | |||
return equality.Semantic.DeepDerivative(desired.Spec, existing.Spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docs
DeepDerivative is similar to DeepEqual except that unset fields in a1 are ignored (not compared). This allows us to focus on the fields that matter to the semantic comparison.
The unset fields include a nil pointer and an empty string.
@hdefazio In the backendobjectreference, the Group field is set to empty string (which is default). If it is accidently updated, does the controller pick up the change and update it to the desired spec i.e. set the Group field to an empty string ?
@sivanantha321 Got it, I'll open a task on my side to write some documentation. I'll add it to the next sprint we have in ~2 weeks |
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
…658) * Stop and resume a model [Raw Deployment] (kserve#4455) Signed-off-by: Hannah DeFazio <h2defazio@gmail.com> * Add missing stopped=false status in test --------- Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
…serve#658) * Stop and resume a model [Raw Deployment] (kserve#4455) Signed-off-by: Hannah DeFazio <h2defazio@gmail.com> * Add missing stopped=false status in test --------- Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
What this PR does / why we need it:
This adds the ability to stop and resume services in raw deployment mode without permanently deleting configurations or incurring delays caused by gradual scaling processes
This PR also makes some minor changes to the work done in #4337
Which issue(s) this PR fixes (optional, in fixes #(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4207
Type of changes
Please delete options that are not relevant.
Feature/Issue validation/testing:
Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
cluster:
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.15+ff493be
Test setup:
Test:
create a isvc without the annotation
observe that the inference service is Ready
Stop the model
It should remove all objects except the isvc
Observe that the ISVC now has a Stopped condition
Resume the model
Observe that the deployment, and pods are created and ready and that all conditions on the ISVC are Ready
Special notes for your reviewer:
Checklist:
Release note:
Re-running failed tests
/rerun-all
- rerun all failed workflows./rerun-workflow <workflow name>
- rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.