-
8000
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Fix deferred jobs handling #9297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
043ac1c
to
d86870a
Compare
d86870a
to
33643ac
Compare
7427e7a
to
445bc28
Compare
445bc28
to
00ee82c
Compare
/check |
❌ Some checks failed |
/check |
❌ Some checks failed |
/check |
❌ Some checks failed |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9297 +/- ##
===========================================
- Coverage 73.90% 73.80% -0.10%
===========================================
Files 435 436 +1
Lines 45681 45803 +122
Branches 3924 3925 +1
===========================================
+ Hits 33759 33804 +45
- Misses 11922 11999 +77
🚀 New features to boost your workflow:
|
Upstream fix: rq/rq#2241 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong number of arguments for 'watch' command
seems to be fixed. I reproduced it on develop and haven't seen such errors during testing on this PR.- I still could reproduce the issue with multiple jobs (more than 2) from 1 user being queued / started in parallel, even without canceling any jobs. With canceling, in turn, I could get 3 more API issues:
2.1. It's possible to getHTTP response body: {"detail":"You don't have permission to perform this action"}
when trying to retrieve a request if it was cancelled
2.2. It's possible to get empty status in a cancelled request, leading to failing response parsing in SDK
2.3. It's possible to get an RQ job parsing errorrq.exceptions.NoSuchJobError: Unexpected job format: {'last_heartbeat': b'2025-05-08T15:40:48.317914Z', 'status': b'started', 'started_at': b'2025-05-08T15:40:48.317914Z', 'worker_name': b'390dc20f6f8b40bca0c37fdaa733401d'}
during retrieveing a cancelled request
I don't think the newly found errors should be addressed in this PR. I'm not sure if the issue with jobs from one user was supposed to be fixed, but if so, it seems like it deserves a separate PR and a related deep investigation.
Several insights on 2:
- I tested repeated parallel export requests in a loop for 3 jobs, and, probably, I could find all the known errors we met, even with 1 server and 2 export workers.
- It's possible to get 2 jobs in the started state (or 1 started + 1 queued) from 1 user, because of the operation order during job finish in the RQ workers. They first put dependencies into the queue and only then report finishing to RQ. This is a kind of a false positive case.
- It's possible to get multiple jobs from 1 user in the started state if the job fails because the worker has failed or there was an exception in the job. In this case, the started job is kept for some time, then becomes abandoned. Also a false positive.
- If the job takes some time to be completed, errors can be reproduced much more reliably.
- The job enqueueing in the dataset export endpoint seems to be correct overall.
- I still could get hanging deferred jobs, even if their dependency has completed and there was no loop. I'm not sure if the added periodic job could resolve this problem entirely.
- The error seems to happen only during job finishes with multiple workers. Export works fine in most cases, I couldn't reliably pollute the export queue.
Hypotheses:
- The job dependency analysis can potentially be an issue, I couldn't discard this hypothesis so far.
- There is the call to
is_deferred
, which does a status refresh. Changing this toget_status()
doesn't solve the problem immediately. - Maybe dependencies can be missed if they change state during job dependency analysis, the code doesn't get notified about this.
- Maybe there is some logic conflict with dependency pushing into the queue by RQ workers and jobs with same names
- There is the call to
For further analysis, I'd enable job status logging from each worker (add a job UUID into meta for tracking different jobs with same rq id), then find if we really have cases where several jobs are incorrectly processed in parallel (taking into account the order of job finish operations in RQ workers). Currently, I was checking it manually in django queues in the admin panel, and it has some delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With canceling, in turn, I could get 3 more API issues:
I've created a separate issue for this
I still could get hanging deferred jobs, even if their dependency has completed and there was no loop. I'm not sure if the added periodic job could resolve this problem entirely.
Could you please comment on why you have such doubts?
For further analysis, I'd enable job status logging from each worker (add a job UUID into meta for tracking different jobs with same rq id), then find if we really have cases where several jobs are incorrectly processed in parallel (taking into account the order of job finish operations in RQ workers). Currently, I was checking it manually in django queues in the admin panel, and it has some delay.
This also deserves a separate issue/PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please comment on why you have such doubts?
I just didn't test calling the new cleanup function for that situation. I double checked it now, it seems to be working.
|
Motivation and context
depends on #9303The following issue is also fixed:
How can it be tested manually?
Follow these steps on the develop branch and here:
ONE_RUNNING_JOB_IN_QUEUE_PER_USER
envexport
funcredis.exceptions.ResponseError: wrong number of arguments for 'watch' command
is not raisedHow has this been tested?
Checklist
develop
branchLicense
Feel free to contact the maintainers if that's a concern.