Stop and resume a model [Raw Deployment] #4455

hdefazio · 2025-05-09T02:54:22Z

What this PR does / why we need it:
This adds the ability to stop and resume services in raw deployment mode without permanently deleting configurations or incurring delays caused by gradual scaling processes

This PR also makes some minor changes to the work done in #4337

Which issue(s) this PR fixes (optional, in fixes #(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4207

Type of changes
Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

cluster:
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.15+ff493be

Test setup:

$ kubectl create ns stop-resume-demo
$ kubectl config set-context --current --namespace stop-resume-demo 
$ kustomize build ./config/runtimes/| sed 's/ClusterServingRuntime/ServingRuntime/g' |kubectl create -n stop-resume-demo -f -

$ podman build -t kserve:stop-resume . 
$ podman push kserve:stop-resume  quay.io/hdefazio/kserve:stop-resume

Edited kserve-controller-manager to use quay.io/hdefazio/kserve:stop-resume

Test:

Normal deployment --> add the annotation

create a isvc without the annotation

$ cat <<EOF|kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
    serving.kserve.io/deploymentMode: RawDeployment
  name: "sklearndemo1"
  namespace: stop-resume-demo
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
EOF

observe that the inference service is Ready

$ kubectl get isvc -n stop-resume-demo
NAME           URL                                                                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
sklearndemo1   https://sklearndemo1-stop-resume-demo.apps.rosa.han-stop-resume.wtsd.p3.openshiftapps.com   True           100                              sklearndemo1-predictor-00002   2m8s

$ kubectl get pods -n stop-resume-demo
NAME                                                       READY   STATUS    RESTARTS   AGE
sklearndemo1-predictor-00002-deployment-679bd4754f-mwvsv   3/3     Running   0          4m50s

Stop the model

$ kubectl edit isvc/sklearndemo1 -n stop-resume-demo 
Add annotation `serving.kserve.io/stop: 'true'`

It should remove all objects except the isvc

Observe that the ISVC now has a Stopped condition

kubectl describe isvc/sklearndemo1 -n stop-resume-demo

Stopped model--> set the annotation to false
Resume the model

$ kubectl edit isvc/sklearndemo1 -n stop-resume-demo 
Set annotation `serving.kserve.io/stop: 'false'`

Observe that the deployment, and pods are created and ready and that all conditions on the ISVC are Ready

Logs

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

Have you added unit/e2e tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

Release note:

Re-running failed tests

/rerun-all - rerun all failed workflows.
/rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

johnugeorge · 2025-05-13T18:01:01Z

@hdefazio Can you change the references to KServe issues?

hdefazio · 2025-05-14T13:48:45Z

@johnugeorge Just updated!

hdefazio · 2025-05-20T18:32:19Z

/rerun-all

hdefazio · 2025-05-20T19:02:51Z

/rerun-all

pierDipi

I would've structured the feature reconciliation a little bit differently but the Serverless mode was done this way, so I'd stay consistent, but was it considered to have the "stop" be a separate reconciliation path since it's mostly deleting child resources so that not every sub-reconciler needs to be aware of the stop feature (which would also reduce the level of branching each has)?

pkg/controller/v1beta1/inferenceservice/components/predictor.go

pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go

pkg/controller/v1beta1/inferenceservice/components/predictor.go

pkg/controller/v1beta1/inferenceservice/reconcilers/deployment/deployment_reconciler.go

pkg/controller/v1beta1/inferenceservice/reconcilers/hpa/hpa_reconciler.go

pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go

pkg/controller/v1beta1/inferenceservice/reconcilers/service/service_reconciler.go

pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler.go

pierDipi · 2025-05-21T15:05:47Z

In CI job E2E Tests / test-path-based-routing (pull_request) I see

Loading image kserve-localmodel-controller-ac5980e00a94684e3d201e35e739a4b390a456cb
write /manager: no space left on device

in Go / Build (pull_request)

E0519 20:42:24.824750   34583 suite_test.go:52] unable to start control plane itself: failed to start the controlplane. retried 5 times: exec: "etcd": executable file not found in $PATHFailed to start testing panel

they look like unrelated errors

pkg/controller/v1beta1/inferenceservice/reconcilers/deployment/deployment_reconciler.go

pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go

pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/kube_ingress_reconciler.go

israel-hdez · 2025-06-04T18:24:38Z

pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go

-		// Do a dry-run update to avoid diffs generated by default values.
-		// This will populate our local httpRoute with any default values that are present on the remote version.
-		if err := r.client.Update(ctx, desired, client.DryRunAll); err != nil {
-			log.Error(err, "Failed to perform dry-run update for predictor HttpRoute", "name", desired.Name)


The removal of the dry-run, wouldn't it have any side effects? I think the cluster sets some defaults, so the controller may keep updating it constantly.

I changed semanticHttpRouteEquals to use DeepDerivative instead of DeepEqual, which should accomplish determining the diff without needing a dry run. I can change it back if you're worried about it though

@pierDipi Suggested it to help clean up the functions

OK sounds good. I did see the change to DeepDerivative. I didn't relate this removal to that change.

israel-hdez · 2025-06-04T18:29:26Z

pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler.go

-		// Do a dry-run update to avoid diffs generated by default values.
-		// This will populate our local httpRoute with any default values that are present on the remote version.
-		if err := o.client.Update(ctx, desired, client.DryRunAll); err != nil {
-			log.Error(err, "Failed to perform dry-run update for OTel Collector", "name", desired.Name)


Similar question here, about if the missing dry-run would lead to constantly updating the resource?

pkg/controller/v1beta1/inferenceservice/reconcilers/service/service_reconciler.go

israel-hdez · 2025-06-04T18:37:21Z

pkg/controller/v1beta1/inferenceservice/reconcilers/service/service_reconciler.go

@@ -289,6 +299,11 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context) ([]*corev1.Service, e
 			opErr = r.client.Create(ctx, svc)
 		case constants.CheckResultUpdate:
 			opErr = r.client.Update(ctx, svc)
+		case constants.CheckResultDelete:
+			log.Info("Deleting service", "namespace", svc.Namespace, "name", svc.Name)
+			if svc.GetDeletionTimestamp() == nil { // check if the service was already deleted


No check for owner reference... Similarly, in the deployment_reconciler.
I can suggest to add the check in both places.

pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go

israel-hdez · 2025-06-05T01:09:35Z

The very large error has its root in a panic reported here: https://github.com/kserve/kserve/actions/runs/15434418706/job/43437934692?pr=4455#step:5:666 :

panic: test timed out after 10m0s

This is go test timing out. I disabled the timeout and I get a much more readable report:

Summarizing 3 Failures:
  [FAIL] v1beta1 inference service controller When creating inference service with raw kube predictor and `serving.kserve.io/stop` [It] Should delete the httproute/service/deployment/hpa when the annotation is updated to true on an existing ISVC
  /home/ehernand/Projects/kserve/pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go:2593
  [FAIL] v1beta1 inference service controller When creating inference service with raw kube predictor and transformer [It] Should have httproute/service/deployment/hpa created for transformer and predictor
  /home/ehernand/Projects/kserve/pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go:4686
  [FAIL] v1beta1 inference service controller When creating inference service with raw kube path based routing predictor and transformer [It] Should have ingress/service/deployment/hpa created for transformer and predictor
  /home/ehernand/Projects/kserve/pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go:7217

Ran 62 of 62 Specs in 1212.018 seconds
FAIL! -- 59 Passed | 3 Failed | 0 Pending | 0 Skipped
--- FAIL: TestV1beta1APIs (1212.02s)

Notice the last line TestV1beta1APIs (1212.02s) reports a full run is taking about 20minutes in my machine, while go test default timeout is 10minutes. From docs:

	-timeout d
	    If a test binary runs longer than duration d, panic.
	    If d is 0, the timeout is disabled.
	    The default is 10 minutes (10m).

I don't think we should disable the timeout. IMO we should simply increase it to allow for the units to run properly. You can do it in the Makefile. I can suggest a 30min timeout given the results on my machine.

About the failing tests, the existing ones:

rawkube_controller_test.go:4686
rawkube_controller_test.go:7217

apparently are just failing because the check is out of order, and both give a diff like the following:

                              {
      -                                 Type:     "TransformerReady",
      +                                 Type:     "Stopped",
      -                                 Status:   "True",
      +                                 Status:   "False",
                                        Severity: "Info",
                                        ... // 1 ignored and 2 identical fields
                                },
                                {
      -                                 Type:     "Stopped",
      +                                 Type:     "TransformerReady",
      -                                 Status:   "False",
      +                                 Status:   "True",
                                        Severity: "Info",
                                        ... // 1 ignored and 2 identical fields
                                },

suggesting you just need to swap the array you updated here: https://github.com/kserve/kserve/pull/4455/files#r2127716685 and here: https://github.com/kserve/kserve/pull/4455/files#r2127718390. I have to say that the different order looks suspicious, although in practice order doesn't really matter.

Finally, the unit failing in rawkube_controller_test.go:2593, suggests that something is buggy. The error reported is that the Route is missing:

[FAILED] Timed out after 60.000s.
  Expected success, but got an error:
      <*errors.StatusError | 0xc00183ad20>: 
      HTTPRoute.gateway.networking.k8s.io "stop-edit-true-isvc" not found

I don't know if it the unit is buggy, or if the code is buggy. At the moment, I know that, apparently, the HTTPRoute reconciler is not running.

pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go

israel-hdez · 2025-06-05T21:44:04Z

The problematic unit rawkube_controller_test.go:2593 is failing because the Context is buggy.

Inside its containing Context("When creating inference service with raw kube predictor and ``serving.kserve.io/stop``") you have a configs variable:

configs := map[string]string{
			"explainers": `[...snip...]`,
			"ingress": `[...snip...]`,
			"storageInitializer": `[...snip...]`,
		}

This variable is assigned inside createInferenceServiceConfigMap function, but the assignment is ordinary. Since maps are reference types, no copy is done.

Thus, units requiring to turn off Gateway API, like It("Should keep the ingress/service/deployment/keda/otel created when gateway api is disabled and the annotation is set to false") here are mutating the global configs and the mutated config is used in following units.

To fix, see comment: https://github.com/kserve/kserve/pull/4455/files#r2130518090

It turns out that @pierDipi was right that Gateway API was configured as disabled.

…it, add initial test case Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

… vars, cleanup Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

…isvc method that does the same thing Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

…or deployment and service Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

hdefazio · 2025-06-11T13:53:50Z

< 10000 task-lists disabled sortable>

/rerun-all

hdefazio · 2025-06-11T14:08:58Z

/rerun-all

israel-hdez

Great work @hdefazio

Jooho · 2025-06-11T21:45:54Z

/approve
/lgtm

sivanantha321 · 2025-06-12T06:29:34Z

@hdefazio Documenting this feature would be nice !

sivanantha321 · 2025-06-12T07:09:10Z

pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/httproute_reconciler.go

@@ -509,7 +509,7 @@ func createRawTopLevelHTTPRoute(isvc *v1beta1.InferenceService, ingressConfig *v
 }

 func semanticHttpRouteEquals(desired, existing *gatewayapiv1.HTTPRoute) bool {
-	return equality.Semantic.DeepEqual(desired.Spec, existing.Spec)
+	return equality.Semantic.DeepDerivative(desired.Spec, existing.Spec)


From the docs

DeepDerivative is similar to DeepEqual except that unset fields in a1 are ignored (not compared). This allows us to focus on the fields that matter to the semantic comparison.

The unset fields include a nil pointer and an empty string.

@hdefazio In the backendobjectreference, the Group field is set to empty string (which is default). If it is accidently updated, does the controller pick up the change and update it to the desired spec i.e. set the Group field to an empty string ?

hdefazio · 2025-06-12T15:39:35Z

@sivanantha321 Got it, I'll open a task on my side to write some documentation. I'll add it to the next sprint we have in ~2 weeks

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

…658) * Stop and resume a model [Raw Deployment] (kserve#4455) Signed-off-by: Hannah DeFazio <h2defazio@gmail.com> * Add missing stopped=false status in test --------- Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

…serve#658) * Stop and resume a model [Raw Deployment] (kserve#4455) Signed-off-by: Hannah DeFazio <h2defazio@gmail.com> * Add missing stopped=false status in test --------- Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

hdefazio marked this pull request as draft May 9, 2025 02:54

hdefazio force-pushed the feat/raw_stop_resume branch from 312b5b5 to 6a5e48d Compare May 9, 2025 14:45

hdefazio mentioned this pull request May 9, 2025

Fix issue with precommit hook #4456

Merged

9 tasks

hdefazio force-pushed the feat/raw_stop_resume branch from e089c6d to 84abfaf Compare May 19, 2025 20:37

pierDipi reviewed May 21, 2025

View reviewed changes

hdefazio marked this pull request as ready for review May 27, 2025 14:17

hdefazio force-pushed the feat/raw_stop_resume branch 3 times, most recently from ff55e8e to e663d00 Compare May 30, 2025 03:57

israel-hdez reviewed Jun 4, 2025

View reviewed changes

israel-hdez reviewed Jun 5, 2025

View reviewed changes

pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go Outdated Show resolved Hide resolved

israel-hdez reviewed Jun 5, 2025

View reviewed changes

pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go Outdated Show resolved Hide resolved

israel-hdez reviewed Jun 5, 2025

View reviewed changes

pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go Outdated Show resolved Hide resolved

hdefazio force-pushed the feat/raw_stop_resume branch 3 times, most recently from de87268 to 2e1ff42 Compare June 11, 2025 13:00

hdefazio added 7 commits June 11, 2025 09:00

Move stopped status update code so both serverless and raw can reach …

401bfa3

…it, add initial test case Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Delete the deployment and http routes, Add stopped test case

3f75210

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Use consistenly in test cases, fix bug

63c4bb0

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Delete the service and hpa

a4d0812

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Delete the ingress, keda autoscaler, and openTelemetry collector

a72b236

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Formatting

96f6e20

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Formatting: Use constant for reason, change log msg, rename snakecase…

20a0dde

… vars, cleanup Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

hdefazio added 7 commits June 11, 2025 09:00

Create generic function for checking the stop annotation, remove the …

520bd72

…isvc method that does the same thing Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Move the stopped reason constant

3ab6ead

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Add test cases for updating an existing ISVC, fix bad return values f…

0db6713

…or deployment and service Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Add stopped condition to expected ISVC status values

419ee83

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Refactor functions with lots of nested if statements

c05df83

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Fix unit tests, cleanup

4c7be5e

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Check for the ownership before forcibly stopping

2513050

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

hdefazio force-pushed the feat/raw_stop_resume branch from 2e1ff42 to 2513050 Compare June 11, 2025 13:00

israel-hdez approved these changes Jun 11, 2025

View reviewed changes

github-actions bot added the lgtm label Jun 11, 2025

github-actions bot approved these changes Jun 11, 2025

View reviewed changes

Jooho added the approved label Jun 11, 2025

Jooho merged commit 6530d1e into kserve:master Jun 11, 2025
94 of 100 checks passed

sivanantha321 reviewed Jun 12, 2025

View reviewed changes

hdefazio added a commit to hdefazio/kserve that referenced this pull request Jun 12, 2025

Stop and resume a model [Raw Deployment] (kserve#4455)

a14d4e9

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

hdefazio mentioned this pull request Jun 12, 2025

[Cherry pick] Stop and resume a model [Raw Deployment] (#4455) opendatahub-io/kserve#658

Merged

hdefazio added a commit to hdefazio/kserve that referenced this pull request Jun 13, 2025

Stop and resume a model [Raw Deployment] (kserve#4455)

69fd99a

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

Stop and resume a model [Raw Deployment] #4455

Stop and resume a model [Raw Deployment] #4455

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!