8000 Fix deferred jobs handling by Marishka17 · Pull Request #9297 · cvat-ai/cvat · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix deferred jobs handling #9297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: develop
Choose a base branch
from
Open

Conversation

Marishka17
Copy link
Contributor
@Marishka17 Marishka17 commented Apr 2, 2025

Motivation and context

depends on #9303
The following issue is also fixed:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/rq/worker.py", line 1443, in perform_job
    self.handle_job_success(job=job, queue=queue, started_job_registry=started_job_registry)
  File "/opt/venv/lib/python3.10/site-packages/rq/worker.py", line 1382, in handle_job_success
    queue.enqueue_dependents(job, pipeline=pipeline)
  File "/opt/venv/lib/python3.10/site-packages/rq/queue.py", line 1203, in enqueue_dependents
    jobs_to_enqueue = [
  File "/opt/venv/lib/python3.10/site-packages/rq/queue.py", line 1209, in <listcomp>
    and dependent_job.dependencies_are_met(
  File "/opt/venv/lib/python3.10/site-packages/rq/job.py", line 1577, in dependencies_are_met
    connection.watch(*[self.key_for(dependency_id) for dependency_id in self._dependency_ids])
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 2138, in watch
    return self.execute_command("WATCH", *names)
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1901, in execute_command
    return self.immediate_execute_command(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1940, in immediate_execute_command
    return conn.retry.call_with_retry(
  File "/opt/venv/lib/python3.10/site-packages/redis/retry.py", line 46, in call_with_retry
    return do()
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1941, in <lambda>
    lambda: self._send_command_parse_response(
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1246, in _send_command_parse_response
    return self.parse_response(conn, command_name, **options)
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 2060, in parse_response
    result = Redis.parse_response(self, connection, command_name, **options)
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1286, in parse_response
    response = connection.read_response()
  File "/opt/venv/lib/python3.10/site-packages/redis/connection.py", line 905, in read_response
    raise response
redis.exceptions.ResponseError: wrong number of arguments for 'watch' command

How can it be tested manually?
Follow these steps on the develop branch and here:

  1. set ONE_RUNNING_JOB_IN_QUEUE_PER_USER env
  2. be sure that the export process takes some time, e.g, run sleep inside the export func
  3. initiate 3 exports consistently (X1, X2, X3)
  4. cancel X2 when X1 is running: POST /api/requests/rq_id/cancel
  5. initiate X2 export one more time (be sure that X1 is still running)
  6. check that X1 is finished successfully and redis.exceptions.ResponseError: wrong number of arguments for 'watch' command is not raised

How has this been tested?

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

@Marishka17 Marishka17 requested a review from SpecLad as a code owner April 2, 2025 11:52
@Marishka17 Marishka17 force-pushed the mk/fix_handling_deferred_jobs branch from 043ac1c to d86870a Compare April 2, 2025 16:32
@Marishka17 Marishka17 force-pushed the mk/fix_handling_deferred_jobs branch from d86870a to 33643ac Compare April 2, 2025 16:41
@Marishka17 Marishka17 changed the title [WIP] Fix deferred jobs handling Fix deferred jobs handling Apr 2, 2025
@Marishka17 Marishka17 changed the title Fix deferred jobs handling [do not merge] Fix deferred jobs handling Apr 3, 2025
@Marishka17 Marishka17 requested a review from zhiltsov-max as a code owner April 3, 2025 11:14
@Marishka17 Marishka17 force-pushed the mk/fix_handling_deferred_jobs branch from 7427e7a to 445bc28 Compare April 3, 2025 11:42
@Marishka17 Marishka17 changed the title [do not merge] Fix deferred jobs handling [Dependent] Fix deferred jobs handling Apr 3, 2025
@Marishka17 Marishka17 force-pushed the mk/fix_handling_deferred_jobs branch from 445bc28 to 00ee82c Compare April 3, 2025 11:44
@Marishka17
Copy link
Contributor Author

/check

Copy link
Contributor
github-actions bot commented Apr 3, 2025

❌ Some checks failed
📄 See logs here

@Marishka17
Copy link
Contributor Author

/check

Copy link
Contributor
github-actions bot commented Apr 8, 2025

❌ Some checks failed
📄 See logs here

@Marishka17 Marishka17 changed the title [Dependent] Fix deferred jobs handling [wip] Fix deferred jobs handling Apr 10, 2025
@Marishka17
Copy link
Contributor Author

/check

Copy link
Contributor
github-actions bot commented Apr 16, 2025

❌ Some checks failed
📄 See logs here

@Marishka17 Marishka17 changed the title [wip] Fix deferred jobs handling Fix deferred jobs handling May 6, 2025
@codecov-commenter
Copy link
codecov-commenter commented May 6, 2025

Codecov Report

Attention: Patch coverage is 22.64151% with 41 lines in your changes missing coverage. Please review.

Project coverage is 73.80%. Comparing base (69c8b20) to head (4bfac62).
Report is 10 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9297      +/-   ##
===========================================
- Coverage    73.90%   73.80%   -0.10%     
===========================================
  Files          435      436       +1     
  Lines        45681    45803     +122     
  Branches      3924     3925       +1     
===========================================
+ Hits         33759    33804      +45     
- Misses       11922    11999      +77     
Components Coverage Δ
cvat-ui 77.54% <50.00%> (-0.01%) ⬇️
cvat-server 70.80% <21.56%> (-0.17%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Marishka17 Marishka17 requested a review from nmanovic as a code owner May 7, 2025 08:56
@Marishka17
Copy link
Contributor Author

Upstream fix: rq/rq#2241

@Marishka17 Marishka17 removed request for SpecLad and nmanovic May 7, 2025 14:13
Copy link
Contributor
@zhiltsov-max zhiltsov-max May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. wrong number of arguments for 'watch' command seems to be fixed. I reproduced it on develop and haven't seen such errors during testing on this PR.
  2. I still could reproduce the issue with multiple jobs (more than 2) from 1 user being queued / started in parallel, even without canceling any jobs. With canceling, in turn, I could get 3 more API issues:
    2.1. It's possible to get HTTP response body: {"detail":"You don't have permission to perform this action"} when trying to retrieve a request if it was cancelled
    2.2. It's possible to get empty status in a cancelled request, leading to failing response parsing in SDK
    2.3. It's possible to get an RQ job parsing error rq.exceptions.NoSuchJobError: Unexpected job format: {'last_heartbeat': b'2025-05-08T15:40:48.317914Z', 'status': b'started', 'started_at': b'2025-05-08T15:40:48.317914Z', 'worker_name': b'390dc20f6f8b40bca0c37fdaa733401d'} during retrieveing a cancelled request

I don't think the newly found errors should be addressed in this PR. I'm not sure if the issue with jobs from one user was supposed to be fixed, but if so, it seems like it deserves a separate PR and a related deep investigation.

Several insights on 2:

  • I tested repeated parallel export requests in a loop for 3 jobs, and, probably, I could find all the known errors we met, even with 1 server and 2 export workers.
  • It's possible to get 2 jobs in the started state (or 1 started + 1 queued) from 1 user, because of the operation order during job finish in the RQ workers. They first put dependencies into the queue and only then report finishing to RQ. This is a kind of a false positive case.
  • It's possible to get multiple jobs from 1 user in the started state if the job fails because the worker has failed or there was an exception in the job. In this case, the started job is kept for some time, then becomes abandoned. Also a false positive.
  • If the job takes some time to be completed, errors can be reproduced much more reliably.
  • The job enqueueing in the dataset export endpoint seems to be correct overall.
  • I still could get hanging deferred jobs, even if their dependency has completed and there was no loop. I'm not sure if the added periodic job could resolve this problem entirely.
  • The error seems to happen only during job finishes with multiple workers. Export works fine in most cases, I couldn't reliably pollute the export queue.

Hypotheses:

  • The job dependency analysis can potentially be an issue, I couldn't discard this hypothesis so far.
    • There is the call to is_deferred, which does a status refresh. Changing this to get_status() doesn't solve the problem immediately.
    • Maybe dependencies can be missed if they change state during job dependency analysis, the code doesn't get notified about this.
    • Maybe there is some logic conflict with dependency pushing into the queue by RQ workers and jobs with same names

For further analysis, I'd enable job status logging from each worker (add a job UUID into meta for tracking different jobs with same rq id), then find if we really have cases where several jobs are incorrectly processed in parallel (taking into account the order of job finish operations in RQ workers). Currently, I was checking it manually in django queues in the admin panel, and it has some delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With canceling, in turn, I could get 3 more API issues:

I've created a separate issue for this

I still could get hanging deferred jobs, even if their dependency has completed and there was no loop. I'm not sure if the added periodic job could resolve this problem entirely.

Could you please comment on why you have such doubts?

For further analysis, I'd enable job status logging from each worker (add a job UUID into meta for tracking different jobs with same rq id), then find if we really have cases where several jobs are incorrectly processed in parallel (taking into account the order of job finish operations in RQ workers). Currently, I was checking it manually in django queues in the admin panel, and it has some delay.

This also deserves a separate issue/PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please comment on why you have such doubts?

I just didn't test calling the new cleanup function for that situation. I double checked it now, it seems to be working.

@Marishka17 Marishka17 requested a review from bsekachev as a code owner May 12, 2025 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0