8000 Stop and resume a model [Raw Deployment] by hdefazio · Pull Request #4455 · kserve/kserve · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Stop and resume a model [Raw Deployment] #4455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 11, 2025

Conversation

hdefazio
Copy link
Contributor
@hdefazio hdefazio commented May 9, 2025

What this PR does / why we need it:
This adds the ability to stop and resume services in raw deployment mode without permanently deleting configurations or incurring delays caused by gradual scaling processes

This PR also makes some minor changes to the work done in #4337

Which issue(s) this PR fixes (optional, in fixes #(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4207

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

cluster:
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.15+ff493be

Test setup:

$ kubectl create ns stop-resume-demo
$ kubectl config set-context --current --namespace stop-resume-demo 
$ kustomize build ./config/runtimes/| sed 's/ClusterServingRuntime/ServingRuntime/g' |kubectl create -n stop-resume-demo -f -

$ podman build -t kserve:stop-resume . 
$ podman push kserve:stop-resume  quay.io/hdefazio/kserve:stop-resume

Edited kserve-controller-manager to use quay.io/hdefazio/kserve:stop-resume

Test:

  1. Normal deployment --> add the annotation

create a isvc without the annotation

$ cat <<EOF|kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
    serving.kserve.io/deploymentMode: RawDeployment
  name: "sklearndemo1"
  namespace: stop-resume-demo
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
EOF

observe that the inference service is Ready

$ kubectl get isvc -n stop-resume-demo
NAME           URL                                                                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
sklearndemo1   https://sklearndemo1-stop-resume-demo.apps.rosa.han-stop-resume.wtsd.p3.openshiftapps.com   True           100                              sklearndemo1-predictor-00002   2m8s
$ kubectl get pods -n stop-resume-demo
NAME                                                       READY   STATUS    RESTARTS   AGE
sklearndemo1-predictor-00002-deployment-679bd4754f-mwvsv   3/3     Running   0          4m50s

Stop the model

$ kubectl edit isvc/sklearndemo1 -n stop-resume-demo 
Add annotation `serving.kserve.io/stop: 'true'`

It should remove all objects except the isvc

Observe that the ISVC now has a Stopped condition

kubectl describe isvc/sklearndemo1 -n stop-resume-demo 
  1. Stopped model--> set the annotation to false
    Resume the model
$ kubectl edit isvc/sklearndemo1 -n stop-resume-demo 
Set annotation `serving.kserve.io/stop: 'false'`

Observe that the deployment, and pods are created and ready and that all conditions on the ISVC are Ready

  • Logs

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:


Re-running failed tests

  • /rerun-all - rerun all failed workflows.
  • /rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

@hdefazio hdefazio marked this pull request as draft May 9, 2025 02:54
@hdefazio hdefazio force-pushed the feat/raw_stop_resume branch from 312b5b5 to 6a5e48d Compare May 9, 2025 14:45
@hdefazio hdefazio mentioned this pull request May 9, 2025
9 tasks
@johnugeorge
Copy link
Contributor

@hdefazio Can you change the references to KServe issues?

@hdefazio
Copy link
Contributor Author

@johnugeorge Just updated!

@hdefazio hdefazio force-pushed the feat/raw_stop_resume branch from e089c6d to 84abfaf Compare May 19, 2025 20:37
@hdefazio
Copy link
Contributor Author

/rerun-all

1 similar comment
@hdefazio
Copy link
Contributor Author

/rerun-all

Copy link
Member
@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've structured the feature reconciliation a little bit differently but the Serverless mode was done this way, so I'd stay consistent, but was it considered to have the "stop" be a separate reconciliation path since it's mostly deleting child resources so that not every sub-reconciler needs to be aware of the stop feature (which would also reduce the level of branching each has)?

@pierDipi
Copy link
Member

In CI job E2E Tests / test-path-based-routing (pull_request) I see

Loading image kserve-localmodel-controller-ac5980e00a94684e3d201e35e739a4b390a456cb
write /manager: no space left on device

in Go / Build (pull_request)

E0519 20:42:24.824750   34583 suite_test.go:52] unable to start control plane itself: failed to start the controlplane. retried 5 times: exec: "etcd": executable file not found in $PATHFailed to start testing panel

they look like unrelated errors

@hdefazio hdefazio marked this pull request as ready for review May 27, 2025 14:17
@hdefazio hdefazio force-pushed the feat/raw_stop_resume branch 3 times, most recently from ff55e8e to e663d00 Compare May 30, 2025 03:57
Comment on lines -556 to -559
// Do a dry-run update to avoid diffs generated by default values.
// This will populate our local httpRoute with any default values that are present on the remote version.
if err := r.client.Update(ctx, desired, client.DryRunAll); err != nil {
log.Error(err, "Failed to perform dry-run update for predictor HttpRoute", "name", desired.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the dry-run, wouldn't it have any side effects? I think the cluster sets some defaults, so the controller may keep updating it constantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed semanticHttpRouteEquals to use DeepDerivative instead of DeepEqual, which should accomplish determining the diff without needing a dry run. I can change it back if you're worried about it though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierDipi Suggested it to help clean up the functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good. I did see the change to DeepDerivative. I didn't relate this removal to that change.

Comment on lines -163 to -166
// Do a dry-run update to avoid diffs generated by default values.
// This will populate our local httpRoute with any default values that are present on the remote version.
if err := o.client.Update(ctx, desired, client.DryRunAll); err != nil {
log.Error(err, "Failed to perform dry-run update for OTel Collector", "name", desired.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question here, about if the missing dry-run would lead to constantly updating the resource?

@@ -289,6 +299,11 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context) ([]*corev1.Service, e
opErr = r.client.Create(ctx, svc)
case constants.CheckResultUpdate:
opErr = r.client.Update(ctx, svc)
case constants.CheckResultDelete:
log.Info("Deleting service", "namespace", svc.Namespace, "name", svc.Name)
if svc.GetDeletionTimestamp() == nil { // check if the service was already deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No check for owner reference... Similarly, in the deployment_reconciler.
I can suggest to add the check in both places.

@israel-hdez
Copy link
Contributor

The very large error has its root in a panic reported here: https://github.com/kserve/kserve/actions/runs/15434418706/job/43437934692?pr=4455#step:5:666 :

panic: test timed out after 10m0s

This is go test timing out. I disabled the timeout and I get a much more readable report:

Summarizing 3 Failures:
  [FAIL] v1beta1 inference service controller When creating inference service with raw kube predictor and `serving.kserve.io/stop` [It] Should delete the httproute/service/deployment/hpa when the annotation is updated to true on an existing ISVC
  /home/ehernand/Projects/kserve/pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go:2593
  [FAIL] v1beta1 inference service controller When creating inference service with raw kube predictor and transformer [It] Should have httproute/service/deployment/hpa created for transformer and predictor
  /home/ehernand/Projects/kserve/pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go:4686
  [FAIL] v1beta1 inference service controller When creating inference service with raw kube path based routing predictor and transformer [It] Should have ingress/service/deployment/hpa created for transformer and predictor
  /home/ehernand/Projects/kserve/pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go:7217

Ran 62 of 62 Specs in 1212.018 seconds
FAIL! -- 59 Passed | 3 Failed | 0 Pending | 0 Skipped
--- FAIL: TestV1beta1APIs (1212.02s)

Notice the last line TestV1beta1APIs (1212.02s) reports a full run is taking about 20minutes in my machine, while go test default timeout is 10minutes. From docs:

	-timeout d
	    If a test binary runs longer than duration d, panic.
	    If d is 0, the timeout is disabled.
	    The default is 10 minutes (10m).

I don't think we should disable the timeout. IMO we should simply increase it to allow for the units to run properly. You can do it in the Makefile. I can suggest a 30min timeout given the results on my machine.

About the failing tests, the existing ones:

  • rawkube_controller_test.go:4686
  • rawkube_controller_test.go:7217

apparently are just failing because the check is out of order, and both give a diff like the following:

                              {
      -                                 Type:     "TransformerReady",
      +                                 Type:     "Stopped",
      -                                 Status:   "True",
      +                                 Status:   "False",
                                        Severity: "Info",
                                        ... // 1 ignored and 2 identical fields
                                },
                                {
      -                                 Type:     "Stopped",
      +                                 Type:     "TransformerReady",
      -                                 Status:   "False",
      +                                 Status:   "True",
                                        Severity: "Info",
                                        ... // 1 ignored and 2 identical fields
                                },

suggesting you just need to swap the array you updated here: https://github.com/kserve/kserve/pull/4455/files#r2127716685 and here: https://github.com/kserve/kserve/pull/4455/files#r2127718390. I have to say that the different order looks suspicious, although in practice order doesn't really matter.

Finally, the unit failing in rawkube_controller_test.go:2593, suggests that something is buggy. The error reported is that the Route is missing:

[FAILED] Timed out after 60.000s.
  Expected success, but got an error:
      <*errors.StatusError | 0xc00183ad20>: 
      HTTPRoute.gateway.networking.k8s.io "stop-edit-true-isvc" not found

I don't know if it the unit is buggy, or if the code is buggy. At the moment, I know that, apparently, the HTTPRoute reconciler is not running.

@israel-hdez
Copy link
Contributor
israel-hdez commented Jun 5, 2025

The problematic unit rawkube_controller_test.go:2593 is failing because the Context is buggy.

Inside its containing Context("When creating inference service with raw kube predictor and ``serving.kserve.io/stop``") you have a configs variable:

configs := map[string]string{
			"explainers": `[...snip...]`,
			"ingress": `[...snip...]`,
			"storageInitializer": `[...snip...]`,
		}

This variable is assigned inside createInferenceServiceConfigMap function, but the assignment is ordinary. Since maps are reference types, no copy is done.

Thus, units requiring to turn off Gateway API, like It("Should keep the ingress/service/deployment/keda/otel created when gateway api is disabled and the annotation is set to false") here are mutating the global configs and the mutated config is used in following units.

To fix, see comment: https://github.com/kserve/kserve/pull/4455/files#r2130518090

It turns out that @pierDipi was right that Gateway API was configured as disabled.

@hdefazio hdefazio force-pushed the feat/raw_stop_resume branch 3 times, most recently from de87268 to 2e1ff42 Compare June 11, 2025 13:00
hdefazio added 7 commits June 11, 2025 09:00
…it, add initial test case

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
… vars, cleanup

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
hdefazio added 7 commits June 11, 2025 09:00
…isvc method that does the same thing

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
…or deployment and service

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
@hdefazio hdefazio force-pushed the feat/raw_stop_resume branch from 2e1ff42 to 2513050 Compare June 11, 2025 13:00
@hdefazio
Copy link
Contributor Author
< 10000 task-lists disabled sortable>

/rerun-all

1 similar comment
@hdefazio
Copy link
Contributor Author

/rerun-all

Copy link
Contributor
@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @hdefazio

@Jooho
Copy link
Contributor
Jooho commented Jun 11, 2025

/approve
/lgtm

@github-actions github-actions bot added the lgtm label Jun 11, 2025
@Jooho Jooho added the approved label Jun 11, 2025
@Jooho Jooho merged commit 6530d1e into kserve:master Jun 11, 2025
94 of 100 checks passed
@sivanantha321
Copy link
Member

@hdefazio Documenting this feature would be nice !

@@ -509,7 +509,7 @@ func createRawTopLevelHTTPRoute(isvc *v1beta1.InferenceService, ingressConfig *v
}

func semanticHttpRouteEquals(desired, existing *gatewayapiv1.HTTPRoute) bool {
return equality.Semantic.DeepEqual(desired.Spec, existing.Spec)
return equality.Semantic.DeepDerivative(desired.Spec, existing.Spec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs

DeepDerivative is similar to DeepEqual except that unset fields in a1 are ignored (not compared). This allows us to focus on the fields that matter to the semantic comparison.

The unset fields include a nil pointer and an empty string.

@hdefazio In the backendobjectreference, the Group field is set to empty string (which is default). If it is accidently updated, does the controller pick up the change and update it to the desired spec i.e. set the Group field to an empty string ?

@hdefazio
Copy link
Contributor Author

@sivanantha321 Got it, I'll open a task on my side to write some documentation. I'll add it to the next sprint we have in ~2 weeks

hdefazio added a commit to hdefazio/kserve that referenced this pull request Jun 12, 2025
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
hdefazio added a commit to hdefazio/kserve that referenced this pull request Jun 13, 2025
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
openshift-merge-bot bot pushed a commit to opendatahub-io/kserve that referenced this pull request Jun 13, 2025
…658)

* Stop and resume a model [Raw Deployment] (kserve#4455)

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

* Add missing stopped=false status in test

---------

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
hdefazio added a commit to hdefazio/kserve that referenced this pull request Jun 13, 2025
…serve#658)

* Stop and resume a model [Raw Deployment] (kserve#4455)

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

* Add missing stopped=false status in test

---------

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support stop InferenceService
6 participants
0