Use a delaying enqueue handler for drift detection #3401

aruiz14 · 2025-02-27T11:52:13Z

Refers to #3402

These changes introduce an artificial delay for enqueuing BundleDeployment reconciliations when Fleet agents detect changes in the monitored deployed resources.
Non-ready resources often produce multiple successive Kubernetes events in a short period of time, which would cause unnecessary noise and stress not only for the BundleDeployment reconciler running in the agent, but also in the upstream Bundle status reconciler, since this would also be enqueued to reflect the new state from their associated BundleDeployment.

Adding a short delay allows aggregating all those events into a single one (every X seconds, depending on the frequency of events and the configured delay) by using the workqueue's TypedDelayingInterface instead of using Add, which is immediate.
The downside of this obviously a delay in the notification for drift detection, but it's just a matter of finding the correct balance, for which I believe 5 seconds is a reasonable initial value.

Note: this PR also also includes 2 misc changes:

Use concrete types for events and handler.
Remove unused package variable in summarizer (leftover from Vendor apply.DryRun #2854)

Testing

I created a loop that would patch a deployed configmap every second:

for x in $(seq 1 100); do
    kubectl -n testns patch configmap/test-configmap -p '{"data": {"name": "'$x'"}}'
    sleep 1
done

Then observed the timestamps in the agent logs:

...
{"level":"info","ts":"2025-02-27T12:39:46Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"62d82242-3e68-4dfb-b38e-c2d094665262","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:39:51Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"f68e06fe-7471-4d1b-a51c-f912ed3ce69d","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:39:56Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"4038d70c-5a9
8000
7-4458-95c2-bfcfdabde4ae","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:40:01Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"41652f56-1b6e-40c9-9c2e-8e2cd8853141","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:40:06Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"0963cc6e-31f1-4a1d-bd88-d40e419b9998","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:40:12Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"ad7ac5a0-fd70-495f-8d7a-83d095bce1a2","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:40:17Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"fb1bca12-e2a7-4d4e-bc78-fb7d66e3621b","nonModified":false,"nonReady":null}
...

Note: these modifications don't actually cause further updates in the BundleDeployment status, as they don't contain enough detail (e.g. the patch being displayed is only the merge-patch needed to correct the drift, but is not a diff containing the current value)

In order to simulate BundleDeployment updates, I created a simple Bundle, which deploys a sample application, including a fronted Deployment.
- I added a new condition to the status with reason: Error and a custom message, which makes Fleet believe this resource is in non-ready state.

> k get bundledeployments -n cluster-fleet-default-downstream-0-0-f4ef75b4942b   simple-simple -w  | timestamp
[13:05:57] NAME            DEPLOYED   MONITORED   STATUS
...
[13:07:28] simple-simple   True       True        
[13:22:50] simple-simple   True       True        deployment.apps default/frontend error] Foobar., Deployment is available. Replicas: 3

Then, while I kept the watch ongoing, I ran a snippet in a separate shell that would replace the custom message for this condition:

> k -n default get deployment frontend -o json | gsed 's/message": "Foobar.*"/message": "Foobar '$x'"/' | k replace --subresource=status -f -

Finally, the BundleDeployment watch shows how it's only being updated every few events:

...
[13:27:16] simple-simple   True       True        deployment.apps default/frontend error] Foobar 5, Deployment is available. Replicas: 3
[13:27:22] simple-simple   True       True        deployment.apps default/frontend error] Foobar 10, Deployment is available. Replicas: 3
[13:27:27] simple-simple   True       True        deployment.apps default/frontend error] Foobar 15, Deployment is available. Replicas: 3
[13:27:32] simple-simple   True       True        deployment.apps default/frontend error] Foobar 20, Deployment is available. Replicas: 3
[13:27:38] simple-simple   True       True        deployment.apps default/frontend error] Foobar 25, Deployment is available. Replicas: 3
[13:27:43] simple-simple   True       True        deployment.apps default/frontend error] Foobar 30, Deployment is available. Replicas: 3
[13:27:49] simple-simple   True       True        deployment.apps default/frontend error] Foobar 35, Deployment is available. Replicas: 3
...

internal/cmd/agent/controller/drift_controller.go

Co-authored-by: Mario Manno <mario.manno@gmail.com>

This function became obsolete when using a delaying enqueue handler for drift detection; see fleet#3401 [1]. [1]: rancher#3401

* Remove unused function This function became obsolete when using a delaying enqueue handler for drift detection; see fleet#3401 [1]. [1]: #3401 * Modernise output setting in deployer integration tests This makes use of `SetOut` and `SetErr` instead of `SetOutput`, which is deprecated. * Remove unused parameter from test helper methods That parameter is no longer necessary now that our test infrastructure uses `default` as its namespace.

aruiz14 changed the title ~~Use typed event types~~ Use a delaying enqueue handler for drift detection Feb 27, 2025

aruiz14 requested a review from manno February 27, 2025 11:54

aruiz14 added 3 commits February 27, 2025 13:22

Use typed event types

812e7a6

Use a delaying enqueue handler for drift detection

035d3f8

Remove unused package variable

b57f550

aruiz14 force-pushed the agent-drift-detection-delay branch from a3b4333 to b57f550 Compare February 27, 2025 12:23

aruiz14 marked this pull request as ready for review February 27, 2025 14:54

aruiz14 requested a review from a team as a code owner February 27, 2025 14:54

aruiz14 added a commit that referenced this pull request Feb 28, 2025

Backport: Use a delaying enqueue handler for drift detection (#3401)

1caf5ed

manno approved these changes Mar 4, 2025

View reviewed changes

internal/cmd/agent/controller/drift_controller.go Outdated Show resolved Hide resolved

aruiz14 and others added 2 commits March 4, 2025 11:31

Remove TODO

0ca97ab

Co-authored-by: Mario Manno <mario.manno@gmail.com>

Merge branch 'main' into agent-drift-detection-delay

ef7c32d

manno merged commit 17e45b6 into rancher:main Mar 4, 2025
12 checks passed

aruiz14 deleted the agent-drift-detection-delay branch March 4, 2025 13:00

weyfonk added a commit to weyfonk/fleet that referenced this pull request Mar 11, 2025

Remove unused function

ca59e1e

This function became obsolete when using a delaying enqueue handler for drift detection; see fleet#3401 [1]. [1]: rancher#3401

weyfonk added a commit to weyfonk/fleet that referenced this pull request Mar 12, 2025

Remove unused function

8d20a1a

This function became obsolete when using a delaying enqueue handler for drift detection; see fleet#3401 [1]. [1]: rancher#3401

weyfonk added a commit to weyfonk/fleet that referenced this pull request Mar 12, 2025

Remove unused function

9fcb6dc

This function became obsolete when using a delaying enqueue handler for drift detection; see fleet#3401 [1]. [1]: rancher#3401

BrewTestBot mentioned this pull request Mar 24, 2025

fleet-cli 0.12.0 Homebrew/homebrew-core#216339

Merged

aruiz14 mentioned this pull request Apr 16, 2025

Delay GitRepo status reconciliations from bundle status changes #3558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use a delaying enqueue handler for drift detection #3401

Use a delaying enqueue handler for drift detection #3401

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Use a delaying enqueue handler for drift detection #3401

Use a delaying enqueue handler for drift detection #3401

Uh oh!

Conversation

Uh oh!

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!