-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Preserve order of group results with Redis result backend #6218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
flakes seem happier than the previous PR on my machine. I'll see how CI here goes before I dig into getting the full test suite running locally. I've stripped a commit from the previous PR which changed the version of flake8 used since it wasn't relevant to this changeset and was probably super outdated anyway. Otherwise, it's pretty much the same as #4858 except for some resolving the conflict in |
@georgepsarakis - this is a refresh of #4858 where you raised a question about the changes from using a List to a SortedSet in the Redis backend wrt backward compatibility. Can you revisit this? ref: #4858 (comment) |
This pull request introduces 1 alert and fixes 2 when merging 88c7a78 into 9d54c8a - view on LGTM.com new alerts:
fixed alerts:
|
Looks like we didn't change the args the mock accepts for Edit: Ah, I see that CI stops after a single failure. I'll fix up the mocks here and get that suite running locally before I push more changes here. |
This pull request introduces 2 alerts and fixes 1 when merging 00f529d into 9d54c8a - view on LGTM.com new alerts:
fixed alerts:
|
00f529d
to
edab532
Compare
This changeset should be good now. I had to correct the behaviour of the sorted set mocks in the backend test suite since they didn't match what Redis would actually do an also returned the score as well as the desired value in a 2-tuple. I've left a fixup commit on the end which should be squashed down onto the first commit in this branch prior to merge. The mock fixups are in their own commit to make it clear what has changed since the previous MR to get things working. Let me know if you would like any further changes prior to merge, @auvipy. I'll keep an eye open for this CI run but I'm hoping it'll go green now. |
This pull request fixes 1 alert when merging edab532 into 9d54c8a - view on LGTM.com fixed alerts:
|
E redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value .tox/3.8-integration-redis/lib/python3.8/site-packages/redis/connection.py:756: ResponseError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value
.tox/3.8-integration-redis/lib/python3.8/site-packages/redis/connection.py:756: ResponseError
Interesting. IIUC that's blowing up because the integration test is trying to access an element of the result object (or maybe the whole object) as a list. Should be a fairly straightforward fix, I'll try to get to it tomorrow. |
This pull request fixes 2 alerts when merging 18b8cac into 9d54c8a - view on LGTM.com fixed alerts:
|
The test suite still uses `lrange()` and `rpush()` to implement its `redis-echo` task chain integration tests, but these are unrelated to the handling of group results and remain unchanged.
This pull request fixes 2 alerts when merging 9504ff6 into 9d54c8a - view on LGTM.com fixed alerts:
|
Codecov Report
@@ Coverage Diff @@
## master #6218 +/- ##
==========================================
+ Coverage 83.57% 83.60% +0.03%
==========================================
Files 145 145
Lines 17375 17416 +41
Branches 2160 2174 +14
==========================================
+ Hits 14521 14561 +40
- Misses 2631 2632 +1
Partials 223 223
Continue to review full report at Codecov.
|
Looks like we need a test for |
please do so when you have time. |
|
||
client = self.client | ||
jkey = self.get_key_for_group(gid, '.j') | ||
tkey = self.get_key_for_group(gid, '.t') | ||
result = self.encode_result(result, state) | ||
with client.pipeline() as pipe: | ||
8000 | pipeline = pipe \ | |
.rpush(jkey, self.encode([1, tid, state, result])) \ | ||
.llen(jkey) \ | ||
.zadd(jkey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maybe-sybr unfortunately at this point (and Line 449) we break backwards compatibility and exceptions will be raised. I think we need to handle both structures for the transition. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my purposes, I haven't run long term deployments where the Redis instance outlives the Celery one(s). It certainly seems feasible but would either:
- require an extra type check against the backend before we actually build up the right pipeline for the observed type, or
- construct a list or zset based pipeline depending on whether we think the group is zset based or not (perhaps if
group_index is not None
?)
Any preference on the approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more, I don't really see how we can reasonably maintain backward compatibility here. Only workers interact with the results backend and use this code, so IIUC we're talking about situations where there's a progressive upgrade of celery workers in a cluster, with some running old code and others running the new code. Or am I misunderstanding and does result resolution on any part of the system also hit the result backend?
The problem with the mixed worker node situation is that old code will unconditionally use operations for lists while new code should optimistically use operations for sorted sets and fall back to lists if it has to interact with older workers. That'll be racy and break if a worker running old code happens to get beaten to the results key by one running new code. I suppose there could be a decision made on something like feature flags (or a naive version check) passed between workers when they mingle but I'm not familiar enough with the clustering protocol to know if that's a reasonable idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @georgepsarakis - any input on this? AFAICT running code which issues tasks on master and a worker on this branch works fine, and vice versa (with the results being unordered since the worker uses a list to store the group results). So I think what I said in the previous comment about mixed code in clusters of workers is where breakage would occur, and need some input on whether we need some sort of behavioural switch based on something sourced from the mingling process.
46e327b
to
b036164
Compare
This pull request introduces 2 alerts and fixes 4 when merging b036164 into d537be4 - view on LGTM.com new alerts:
fixed alerts:
|
I think the Appveyor build may have died/timed out. The red job ran for an hour and seems t have truncated output. |
need not to worry about AppVeyor now |
E TypeError: 'NoneType' object is not subscriptable |
That's interesting @auvipy - I think that test passed on effectively the same changeset previously. Build 8047 (red): Build 8046 (green): Seems like the test might be racey. |
I restarted the build |
I managed to hit the error running the chord integration tests in |
all the tests passing now here! please submit another PR for the unrelated error |
I will wait for clarification or suggestion about keeping BC from George |
@thedrow - I've made no changes to address the back-compat concerns. Awaiting a reply from @georgepsarakis since I'm not sure how we can go about solving the issue (per my comments in that thread) given my current understanding of when problems would manifest (multiple workers running mixed versions of celery IIUC). Happy to make a follow up PR to fix any potential issues as you all see fit. |
This should ensure that there is no breakage between workers with the code from celery#6218 and those without, unless the cluster owner specifically opts into the new behaviour.
This should ensure that there is no breakage between workers with the code from celery#6218 and those without, unless the cluster owner specifically opts into the new behaviour.
This should ensure that there is no breakage between workers with the code from #6218 and those without, unless the cluster owner specifically opts into the new behaviour.
* Preserve order of return values from groups Fixes celery#3781. * Update for zadd arguments changed in redis-py 3 * Use more explicit loop variable name * Handle group_index not set * Use zrange instead of zrangebyscore * test: Fix Redis sorted set mocks in backend tests * test: Make canvas integration tests use `zrange()` The test suite still uses `lrange()` and `rpush()` to implement its `redis-echo` task chain integration tests, but these are unrelated to the handling of group results and remain unchanged. * test: Add unit tests for `group_index` handling * fix: Add `group_index` to `Context`, chord uplift * test: Sanity check `Request.group_index` property This adds a test to make sure the property exists and also changes the property to use the private `_request_dict` rather than the public property. Co-authored-by: Leo Singer <leo.singer@ligo.org>
This should ensure that there is no breakage between workers with the code from celery#6218 and those without, unless the cluster owner specifically opts into the new behaviour.
Description
This PR is an alternate to #4858 authored by @lpsinger and fixes #3781. It ensures
that the order of results from a canvas group task wrapper matches the order of the
individual encapsulated tasks when using Redis as a results backend.