8000 Use a delaying enqueue handler for drift detection by aruiz14 · Pull Request #3401 · rancher/fleet · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Use a delaying enqueue handler for drift detection #3401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 4, 2025

Conversation

aruiz14
Copy link
Contributor
@aruiz14 aruiz14 commented Feb 27, 2025

Refers to #3402

These changes introduce an artificial delay for enqueuing BundleDeployment reconciliations when Fleet agents detect changes in the monitored deployed resources.
Non-ready resources often produce multiple successive Kubernetes events in a short period of time, which would cause unnecessary noise and stress not only for the BundleDeployment reconciler running in the agent, but also in the upstream Bundle status reconciler, since this would also be enqueued to reflect the new state from their associated BundleDeployment.

Adding a short delay allows aggregating all those events into a single one (every X seconds, depending on the frequency of events and the configured delay) by using the workqueue's TypedDelayingInterface instead of using Add, which is immediate.
The downside of this obviously a delay in the notification for drift detection, but it's just a matter of finding the correct balance, for which I believe 5 seconds is a reasonable initial value.

Note: this PR also also includes 2 misc changes:

  • Use concrete types for events and handler.
  • Remove unused package variable in summarizer (leftover from Vendor apply.DryRun #2854)

Testing

  • I created a loop that would patch a deployed configmap every second:
for x in $(seq 1 100); do
    kubectl -n testns patch configmap/test-configmap -p '{"data": {"name": "'$x'"}}'
    sleep 1
done

Then observed the timestamps in the agent logs:

...
{"level":"info","ts":"2025-02-27T12:39:46Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"62d82242-3e68-4dfb-b38e-c2d094665262","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:39:51Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"f68e06fe-7471-4d1b-a51c-f912ed3ce69d","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:39:56Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"4038d70c-5a9
8000
7-4458-95c2-bfcfdabde4ae","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:40:01Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"41652f56-1b6e-40c9-9c2e-8e2cd8853141","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:40:06Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"0963cc6e-31f1-4a1d-bd88-d40e419b9998","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:40:12Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"ad7ac5a0-fd70-495f-8d7a-83d095bce1a2","nonModified":false,"nonReady":null}
{"level":"info","ts":"2025-02-27T12:40:17Z","logger":"drift.update-status","msg":"Status not ready according to nonModified and nonReady","controller":"drift-reconciler","namespace":"cluster-fleet-default-downstream-0-0-f4ef75b4942b","name":"testns-testbundle","reconcileID":"fb1bca12-e2a7-4d4e-bc78-fb7d66e3621b","nonModified":false,"nonReady":null}
...

Note: these modifications don't actually cause further updates in the BundleDeployment status, as they don't contain enough detail (e.g. the patch being displayed is only the merge-patch needed to correct the drift, but is not a diff containing the current value)

  • In order to simulate BundleDeployment updates, I created a simple Bundle, which deploys a sample application, including a fronted Deployment.
    • I added a new condition to the status with reason: Error and a custom message, which makes Fleet believe this resource is in non-ready state.
> k get bundledeployments -n cluster-fleet-default-downstream-0-0-f4ef75b4942b   simple-simple -w  | timestamp
[13:05:57] NAME            DEPLOYED   MONITORED   STATUS
...
[13:07:28] simple-simple   True       True        
[13:22:50] simple-simple   True       True        deployment.apps default/frontend error] Foobar., Deployment is available. Replicas: 3
  • Then, while I kept the watch ongoing, I ran a snippet in a separate shell that would replace the custom message for this condition:
> k -n default get deployment frontend -o json | gsed 's/message": "Foobar.*"/message": "Foobar '$x'"/' | k replace --subresource=status -f -
  • Finally, the BundleDeployment watch shows how it's only being updated every few events:
...
[13:27:16] simple-simple   True       True        deployment.apps default/frontend error] Foobar 5, Deployment is available. Replicas: 3
[13:27:22] simple-simple   True       True        deployment.apps default/frontend error] Foobar 10, Deployment is available. Replicas: 3
[13:27:27] simple-simple   True       True        deployment.apps default/frontend error] Foobar 15, Deployment is available. Replicas: 3
[13:27:32] simple-simple   True       True        deployment.apps default/frontend error] Foobar 20, Deployment is available. Replicas: 3
[13:27:38] simple-simple   True       True        deployment.apps default/frontend error] Foobar 25, Deployment is available. Replicas: 3
[13:27:43] simple-simple   True       True        deployment.apps default/frontend error] Foobar 30, Deployment is available. Replicas: 3
[13:27:49] simple-simple   True       True        deployment.apps default/frontend error] Foobar 35, Deployment is available. Replicas: 3
...

@aruiz14 aruiz14 changed the title Use typed event types Use a delaying enqueue handler for drift detection Feb 27, 2025
@aruiz14 aruiz14 requested a review from manno February 27, 2025 11:54
@aruiz14 aruiz14 force-pushed the agent-drift-detection-delay branch from a3b4333 to b57f550 Compare February 27, 2025 12:23
@aruiz14 aruiz14 marked this pull request as ready for review February 27, 2025 14:54
@aruiz14 aruiz14 requested a review from a team as a code owner February 27, 2025 14:54
aruiz14 and others added 2 commits March 4, 2025 11:31
Co-authored-by: Mario Manno <mario.manno@gmail.com>
@manno manno merged commit 17e45b6 into rancher:main Mar 4, 2025
12 checks passed
@aruiz14 aruiz14 deleted the agent-drift-detection-delay branch March 4, 2025 13:00
weyfonk added a commit to weyfonk/fleet that referenced this pull request Mar 11, 2025
This function became obsolete when using a delaying enqueue handler for
drift detection; see fleet#3401 [1].

[1]: rancher#3401
weyfonk added a commit to weyfonk/fleet that referenced this pull request Mar 12, 2025
This function became obsolete when using a delaying enqueue handler for
drift detection; see fleet#3401 [1].

[1]: rancher#3401
weyfonk added a commit to weyfonk/fleet that referenced this pull request Mar 12, 2025
This function became obsolete when using a delaying enqueue handler for
drift detection; see fleet#3401 [1].

[1]: rancher#3401
manno pushed a commit that referenced this pull request Mar 13, 2025
* Remove unused function

This function became obsolete when using a delaying enqueue handler for
drift detection; see fleet#3401 [1].

[1]: #3401

* Modernise output setting in deployer integration tests

This makes use of `SetOut` and `SetErr` instead of `SetOutput`, which is
deprecated.

* Remove unused parameter from test helper methods

That parameter is no longer necessary now that our test infrastructure
uses `default` as its namespace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
N 42AE one yet
Development

Successfully merging this pull request may close these issues.

2 participants
0