Re-authenticate if the session is closed by a concurrent request #1031

nolar · 2023-06-11T16:55:04Z

Several concurrent requests can step on each others toes while catching HTTP 401s: one of them will catch 401 and cause the re-auth normally (by invalidating the APIContext via the Vault), while others will be left with the closed session and will not be able even to try executing their retried requests to get that 401. Instead, they will get a generic RuntimeError("Session is closed") from aiohttp.

The concurrent requests are highly likely if the API returned some other errors previously, so that the requests went into the back-off sleep for a few seconds before retrying.

A proper fix would be to retry with a new session, but we currently have no mechanisms to replace the session in a context object (safely). Hence, escalate and try again with new credentials (loosing the retry counter as a side effect).

Under the hood, this concurrency caused conflicts in the vault over the state of the current and invalid credentials:

The credentials invalidation happens by the key only, which is the login handler name. It does not feed the actual credentials object that failed with 401 and thus must be remembered as broken. As a result, the 2nd, 3rd, and following failed API streams invalidated the valid credentials, which was acquired after the 1st failure and re-authentication. Therefore, the following re-authentications that tried to add the same credentials, failed — because that credentials was known to be invalid.

The second contributng factor was the inconsistency between the vault's credentials set (._current) and its boolean readiness (._state), protected by different locks.

Specifically, the case was this:

Client A waits for the vault readiness state (the presence of credentials) under Vault._ready._condition — and gets it, and proceeds further.
Client B gets 401, invalidates the credentials under the lock Vault._lock, leaves the lock, initiates the re-auth again.
Client A reaches the lock Vault._lock and acquires it.
Client A tries to get the next avalable valid credentials — but there is nothing. It fails.
The re-authentication catches up and populates the vault with a new credentials, but it is too late.

Most likely it was overlooked before in Python 3.10-3.11 or so — because they were relatively slow and the sequence of operations managed to be fast, the event loop control was not going out of the routines fast enough on await. Now everything is fasy, so the control leaves the coroutines more often.

Now, all these operations are protected under the same condition object, and there should be no unprotected inconsistencies between the current credentials and the vault's boolean readiness state.

TODO: Requires tests to simulate the situation — if the hypothesis is correct.

Presumably fixes #980 #1158

nolar · 2025-04-25T18:38:15Z

Confirmed to be fixing the issue — see the #980 comments. This change was suspected to cause memory leaks, but later it was proven the leak is not related to this particular change and is reproduced with the main branch, too. Hence, this PR is good to go (after a final quick look, for certainty & safety).

Signed-off-by: Sergey Vasilyev <nolar@nolar.info>

…closed Signed-off-by: Sergey Vasilyev <nolar@nolar.info>

… one by the handler name Signed-off-by: Sergey Vasilyev <nolar@nolar.info>

…fety Otherwise, with the lock & toggle combined, it led to situations where the lock was released after invalidation, leaving the "current" set empty and only triggering the new re-authentication. At the same time, the lock was acquired by a parallel task that has already checked the readiness state and got it "on" (before invalidation), but was waiting on the lock to actually yield the credentials. As a result, the `select()` call failed because the current set of credentials turned out to be empty by that moment. Now, all operations of both credentials population/invalidation, so as checking for the readiness state, are done under the same lock/condition. This should ensure the lack of inconsistent states between the toggle & the sets of credentials in different moments in time. Signed-off-by: Sergey Vasilyev <nolar@nolar.info>

a-m-w · 2025-05-12T10:27:24Z

Would it be possible to cut a new release once this is merged? Thanks again for your great work :)

nolar · 2025-05-12T11:27:14Z

@a-m-w Sorry, I was fighting with some unrelated but blocking CI issues in #1173 #1174 #1175.

But yes, sure thing — 1.38.0:

Mind that it drops Python 3.8 support.

You are welcome!

nolar mentioned this pull request Jun 11, 2023

Watchers on Custom Resources throw RuntimeError("Session is closed") and permanently die #980

Closed

nolar force-pushed the session-closed-in-reauth branch from fc76eaa to 1efbcf7 Compare June 11, 2023 16:59

nolar force-pushed the session-closed-in-reauth branch 2 times, most recently from 31b36db to b3be3e2 Compare March 7, 2025 13:56

nolar force-pushed the session-closed-in-reauth branch 3 times, most recently from a305e26 to d60ffc2 Compare March 26, 2025 19:05

nolar mentioned this pull request Apr 1, 2025

Custom resource watchers die but operator does not in environments with restricted Kubernetes API access #1145

Open

nolar marked this pull request as ready for review April 25, 2025 18:38

nolar force-pushed the session-closed-in-reauth branch from d60ffc2 to af0847d Compare May 3, 2025 14:10

nolar added 4 commits May 12, 2025 11:55

Re-authenticate if the session is closed by a concurrent request

9e11f7c

Signed-off-by: Sergey Vasilyev <nolar@nolar.info>

Re-authenticate on SSL stream closed the same as on TCP/HTTP session …

cfa1218

…closed Signed-off-by: Sergey Vasilyev <nolar@nolar.info>

Invalidate the very specific failed credentials, not just any current…

cbdd3e2

… one by the handler name Signed-off-by: Sergey Vasilyev <nolar@nolar.info>

nolar force-pushed the session-closed-in-reauth branch from 615e2b3 to 5b9cd71 Compare May 12, 2025 09:55

nolar merged commit 3a95e2f into main May 12, 2025
26 checks passed

nolar deleted the session-closed-in-reauth branch May 12, 2025 10:53

creste mentioned this pull request May 16, 2025

Controller loses API connection after token expiry on Azure Kubernetes Service (AKS) 1.30 due to kopf bug dask/dask-kubernetes#913

Closed

renovate bot mentioned this pull request Jun 26, 2025

fix(deps): update python (non-major) premiscale/pass-operator#127

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Re-authenticate if the session is closed by a concurrent request #1031

Re-authenticate if the session is closed by a concurrent request #1031

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Re-authenticate if the session is closed by a concurrent request #1031

Re-authenticate if the session is closed by a concurrent request #1031

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!