8000 Preserve order of group results with Redis result backend by maybe-sybr · Pull Request #6218 · celery/celery · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Preserve order of group results with Redis result backend #6218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 19, 2020

Conversation

maybe-sybr
Copy link
Contributor

Description

This PR is an alternate to #4858 authored by @lpsinger and fixes #3781. It ensures
that the order of results from a canvas group task wrapper matches the order of the
individual encapsulated tasks when using Redis as a results backend.

@maybe-sybr
Copy link
Contributor Author

flakes seem happier than the previous PR on my machine. I'll see how CI here goes before I dig into getting the full test suite running locally.

I've stripped a commit from the previous PR which changed the version of flake8 used since it wasn't relevant to this changeset and was probably super outdated anyway. Otherwise, it's pretty much the same as #4858 except for some resolving the conflict in celery/backends/redis.py.

@maybe-sybr
Copy link
Contributor Author

@georgepsarakis - this is a refresh of #4858 where you raised a question about the changes from using a List to a SortedSet in the Redis backend wrt backward compatibility. Can you revisit this? ref: #4858 (comment)

@lgtm-com
Copy link
lgtm-com bot commented Jul 10, 2020

This pull request introduces 1 alert and fixes 2 when merging 88c7a78 into 9d54c8a - view on LGTM.com

new alerts:

  • 1 for Non-exception in 'except' clause

fixed alerts:

  • 1 for Non-exception in 'except' clause
  • 1 for Module is imported with 'import' and 'import from'

@maybe-sybr
Copy link
Contributor Author
maybe-sybr commented Jul 10, 2020

1 failed
- t/unit/backends/test_redis.py:591 test_RedisBackend.test_on_chord_part_return

Looks like we didn't change the args the mock accepts for zadd() - I think the fixup commit solves that. I am getting some other unit test failures on my box in that file though which is a bit concerning. We'll see how CI goes this time I guess.

Edit: Ah, I see that CI stops after a single failure. I'll fix up the mocks here and get that suite running locally before I push more changes here.

@lgtm-com
Copy link
lgtm-com bot commented Jul 10, 2020

This pull request introduces 2 alerts and fixes 1 when merging 00f529d into 9d54c8a - view on LGTM.com

new alerts:

  • 2 for Module is imported with 'import' and 'import from'

fixed alerts:

  • 1 for Module is imported with 'import' and 'import from'

@auvipy auvipy added this to the 4.4.x milestone Jul 10, 2020
@auvipy auvipy self-requested a review July 10, 2020 04:32
@maybe-sybr maybe-sybr force-pushed the preserve-group-order branch from 00f529d to edab532 Compare July 10, 2020 05:41
@maybe-sybr
Copy link
Contributor Author

This changeset should be good now. I had to correct the behaviour of the sorted set mocks in the backend test suite since they didn't match what Redis would actually do an also returned the score as well as the desired value in a 2-tuple.

I've left a fixup commit on the end which should be squashed down onto the first commit in this branch prior to merge. The mock fixups are in their own commit to make it clear what has changed since the previous MR to get things working. Let me know if you would like any further changes prior to merge, @auvipy. I'll keep an eye open for this CI run but I'm hoping it'll go green now.

@lgtm-com
Copy link
lgtm-com bot commented Jul 10, 2020

This pull request fixes 1 alert when merging edab532 into 9d54c8a - view on LGTM.com

fixed alerts:

  • 1 for Module is imported with 'import' and 'import from'

@auvipy
Copy link
Member
auvipy commented Jul 10, 2020

E redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value

.tox/3.8-integration-redis/lib/python3.8/site-packages/redis/connection.py:756: ResponseError

Copy link
Member
@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value

.tox/3.8-integration-redis/lib/python3.8/site-packages/redis/connection.py:756: ResponseError

@maybe-sybr
Copy link
Contributor Author

Interesting. IIUC that's blowing up because the integration test is trying to access an element of the result object (or maybe the whole object) as a list. Should be a fairly straightforward fix, I'll try to get to it tomorrow.

@lgtm-com
Copy link
lgtm-com bot commented Jul 12, 2020

This pull request fixes 2 alerts when merging 18b8cac into 9d54c8a - view on LGTM.com

fixed alerts:

  • 1 for Non-exception in 'except' clause
  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link
lgtm-com bot commented Jul 12, 2020

This pull request fixes 2 alerts when merging 9504ff6 into 9d54c8a - view on LGTM.com

fixed alerts:

  • 1 for Non-exception in 'except' clause
  • 1 for Module is imported with 'import' and 'import from'

@codecov
Copy link
codecov bot commented Jul 12, 2020

Codecov Report

Merging #6218 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6218      +/-   ##
==========================================
+ Coverage   83.57%   83.60%   +0.03%     
==========================================
  Files         145      145              
  Lines       17375    17416      +41     
  Branches     2160     2174      +14     
==========================================
+ Hits        14521    14561      +40     
- Misses       2631     2632       +1     
  Partials      223      223              
Impacted Files Coverage Δ
celery/app/amqp.py 95.06% <ø> (ø)
celery/app/base.py 61.46% <ø> (ø)
celery/utils/abstract.py 19.04% <ø> (ø)
celery/app/task.py 96.19% <100.00%> (+0.01%) ⬆️
celery/backends/redis.py 92.09% <100.00%> (+0.04%) ⬆️
celery/canvas.py 93.07% <100.00%> (+0.05%) ⬆️
celery/worker/request.py 97.20% <100.00%> (+0.02%) ⬆️
celery/bin/celery.py 97.80% <0.00%> (-0.24%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d54c8a...b036164. Read the comment docs.

@maybe-sybr
Copy link
Contributor Author

Looks like we need a test for group.freeze(group_index=...) and another which sanity tests that the group_index() property exists. That'll fix up the drops in coverage and make codecov happy.

@auvipy
Copy link
Member
auvipy commented Jul 12, 2020

please do so when you have time.

8000

client = self.client
jkey = self.get_key_for_group(gid, '.j')
tkey = self.get_key_for_group(gid, '.t')
result = self.encode_result(result, state)
with client.pipeline() as pipe:
pipeline = pipe \
.rpush(jkey, self.encode([1, tid, state, result])) \
.llen(jkey) \
.zadd(jkey,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maybe-sybr unfortunately at this point (and Line 449) we break backwards compatibility and exceptions will be raised. I think we need to handle both structures for the transition. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my purposes, I haven't run long term deployments where the Redis instance outlives the Celery one(s). It certainly seems feasible but would either:

  • require an extra type check against the backend before we actually build up the right pipeline for the observed type, or
  • construct a list or zset based pipeline depending on whether we think the group is zset based or not (perhaps if group_index is not None?)

Any preference on the approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, I don't really see how we can reasonably maintain backward compatibility here. Only workers interact with the results backend and use this code, so IIUC we're talking about situations where there's a progressive upgrade of celery workers in a cluster, with some running old code and others running the new code. Or am I misunderstanding and does result resolution on any part of the system also hit the result backend?

The problem with the mixed worker node situation is that old code will unconditionally use operations for lists while new code should optimistically use operations for sorted sets and fall back to lists if it has to interact with older workers. That'll be racy and break if a worker running old code happens to get beaten to the results key by one running new code. I suppose there could be a decision made on something like feature flags (or a naive version check) passed between workers when they mingle but I'm not familiar enough with the clustering protocol to know if that's a reasonable idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @georgepsarakis - any input on this? AFAICT running code which issues tasks on master and a worker on this branch works fine, and vice versa (with the results being unordered since the worker uses a list to store the group results). So I think what I said in the previous comment about mixed code in clusters of workers is where breakage would occur, and need some input on whether we need some sort of behavioural switch based on something sourced from the mingling process.

@maybe-sybr maybe-sybr force-pushed the preserve-group-order branch from 46e327b to b036164 Compare July 13, 2020 03:14
@lgtm-com
Copy link
lgtm-com bot commented Jul 13, 2020

This pull request introduces 2 alerts and fixes 4 when merging b036164 into d537be4 - view on LGTM.com

new alerts:

  • 1 for Non-exception in 'except' clause
  • 1 for Wrong number of arguments in a call

fixed alerts:

  • 3 for Module is imported with 'import' and 'import from'
  • 1 for Non-exception in 'except' clause

@maybe-sybr
Copy link
Contributor Author

I think the Appveyor build may have died/timed out. The red job ran for an hour and seems t have truncated output.

@auvipy
Copy link
Member
auvipy commented Jul 13, 2020

need not to worry about AppVeyor now

@auvipy auvipy closed this Jul 13, 2020
@auvipy auvipy reopened this Jul 13, 2020
@auvipy
Copy link
Member
auvipy commented Jul 14, 2020
      res.children[0].children[0].result[0])

E TypeError: 'NoneType' object is not subscriptable

@maybe-sybr
Copy link
Contributor Author

That's interesting @auvipy - I think that test passed on effectively the same changeset previously.

Build 8047 (red):

Build 8046 (green):

Seems like the test might be racey.

@auvipy auvipy closed this Jul 15, 2020
@auvipy auvipy reopened this Jul 15, 2020
@auvipy
Copy link
Member
auvipy commented Jul 15, 2020

I restarted the build

@maybe-sybr
Copy link
Contributor Author

I managed to hit the error running the chord integration tests in t/integration/test_canvas.py with a redis backend. I'll submit a separate PR if I work out how to fix it since it doesn't seem like it's relevant to this changeset.

@auvipy
Copy link
Member
auvipy commented Jul 15, 2020

all the tests passing now here! please submit another PR for the unrelated error

@auvipy
Copy link
Member
auvipy commented Jul 15, 2020

I will wait for clarification or suggestion about keeping BC from George

@auvipy auvipy merged commit 455e0a0 into celery:master Jul 19, 2020
@maybe-sybr
Copy link
Contributor Author
maybe-sybr commented Jul 20, 2020

@thedrow - I've made no changes to address the back-compat concerns. Awaiting a reply from @georgepsarakis since I'm not sure how we can go about solving the issue (per my comments in that thread) given my current understanding of when problems would manifest (multiple workers running mixed versions of celery IIUC).

Happy to make a follow up PR to fix any potential issues as you all see fit.

@maybe-sybr maybe-sybr deleted the preserve-group-order branch July 27, 2020 23:07
maybe-sybr added a commit to maybe-sybr/celery that referenced this pull request Jul 28, 2020
This should ensure that there is no breakage between workers with the
code from celery#6218 and those without, unless the cluster owner specifically
opts into the new behaviour.
maybe-sybr added a commit to maybe-sybr/celery that referenced this pull request Jul 28, 2020
This should ensure that there is no breakage between workers with the
code from celery#6218 and those without, unless the cluster owner specifically
opts into the new behaviour.
auvipy pushed a commit that referenced this pull request Jul 31, 2020
This should ensure that there is no breakage between workers with the
code from #6218 and those without, unless the cluster owner specifically
opts into the new behaviour.
jeyrce pushed a commit to jeyrce/celery that referenced this pull request Aug 25, 2021
* Preserve order of return values from groups

Fixes celery#3781.

* Update for zadd arguments changed in redis-py 3

* Use more explicit loop variable name

* Handle group_index not set

* Use zrange instead of zrangebyscore

* test: Fix Redis sorted set mocks in backend tests

* test: Make canvas integration tests use `zrange()`

The test suite still uses `lrange()` and `rpush()` to implement its
`redis-echo` task chain integration tests, but these are unrelated to
the handling of group results and remain unchanged.

* test: Add unit tests for `group_index` handling

* fix: Add `group_index` to `Context`, chord uplift

* test: Sanity check `Request.group_index` property

This adds a test to make sure the property exists and also changes the
property to use the private `_request_dict` rather than the public
property.

Co-authored-by: Leo Singer <leo.singer@ligo.org>
jeyrce pushed a commit to jeyrce/celery that referenced this pull request Aug 25, 2021
This should ensure that there is no breakage between workers with the
code from celery#6218 and those without, unless the cluster owner specifically
opts into the new behaviour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Group tasks' results order getting messed up when using redis
4 participants
0