Fix deferred jobs handling #9297

Marishka17 · 2025-04-02T11:52:23Z

Motivation and context

~~depends on #9303~~
The following issue is also fixed:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/rq/worker.py", line 1443, in perform_job
    self.handle_job_success(job=job, queue=queue, started_job_registry=started_job_registry)
  File "/opt/venv/lib/python3.10/site-packages/rq/worker.py", line 1382, in handle_job_success
    queue.enqueue_dependents(job, pipeline=pipeline)
  File "/opt/venv/lib/python3.10/site-packages/rq/queue.py", line 1203, in enqueue_dependents
    jobs_to_enqueue = [
  File "/opt/venv/lib/python3.10/site-packages/rq/queue.py", line 1209, in <listcomp>
    and dependent_job.dependencies_are_met(
  File "/opt/venv/lib/python3.10/site-packages/rq/job.py", line 1577, in dependencies_are_met
    connection.watch(*[self.key_for(dependency_id) for dependency_id in self._dependency_ids])
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 2138, in watch
    return self.execute_command("WATCH", *names)
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1901, in execute_command
    return self.immediate_execute_command(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1940, in immediate_execute_command
    return conn.retry.call_with_retry(
  File "/opt/venv/lib/python3.10/site-packages/redis/retry.py", line 46, in call_with_retry
    return do()
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1941, in <lambda>
    lambda: self._send_command_parse_response(
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1246, in _send_command_parse_response
    return self.parse_response(conn, command_name, **options)
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 2060, in parse_response
    result = Redis.parse_response(self, connection, command_name, **options)
  File "/opt/venv/lib/python3.10/site-packages/redis/client.py", line 1286, in parse_response
    response = connection.read_response()
  File "/opt/venv/lib/python3.10/site-packages/redis/connection.py", line 905, in read_response
    raise response
redis.exceptions.ResponseError: wrong number of arguments for 'watch' command

How can it be tested manually?
Follow these steps on the develop branch and here:

set ONE_RUNNING_JOB_IN_QUEUE_PER_USER env
be sure that the export process takes some time, e.g, run sleep inside the export func
initiate 3 exports consistently (X1, X2, X3)
cancel X2 when X1 is running: POST /api/requests/rq_id/cancel
initiate X2 export one more time (be sure that X1 is still running)
check that X1 is finished successfully and redis.exceptions.ResponseError: wrong number of arguments for 'watch' command is not raised

How has this been tested?

Checklist

I submit my changes into the develop branch
I have created a changelog fragment
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues (see GitHub docs)

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.

Marishka17 · 2025-04-03T11:51:06Z

/check

github-actions · 2025-04-03T11:51:19Z

❌ Some checks failed
📄 See logs here

Marishka17 · 2025-04-08T11:14:32Z

/check

github-actions · 2025-04-08T11:14:43Z

❌ Some checks failed
📄 See logs here

Marishka17 · 2025-04-16T13:09:45Z

/check

github-actions · 2025-04-16T13:09:59Z

❌ Some checks failed
📄 See logs here

codecov-commenter · 2025-05-06T14:17:10Z

Codecov Report

Attention: Patch coverage is 22.64151% with 41 lines in your changes missing coverage. Please review.

Project coverage is 73.80%. Comparing base (69c8b20) to head (4bfac62).
Report is 10 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9297      +/-   ##
===========================================
- Coverage    73.90%   73.80%   -0.10%     
===========================================
  Files          435      436       +1     
  Lines        45681    45803     +122     
  Branches      3924     3925       +1     
===========================================
+ Hits         33759    33804      +45     
- Misses       11922    11999      +77

Components	Coverage Δ
cvat-ui	`77.54% <50.00%> (-0.01%)`	⬇️
cvat-server	`70.80% <21.56%> (-0.17%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Marishka17 · 2025-05-07T11:03:39Z

Upstream fix: rq/rq#2241

cvat-sdk/cvat_sdk/core/proxies/tasks.py

cvat-sdk/cvat_sdk/core/exceptions.py

cvat/rq_patching.py

cvat/settings/base.py

tests/python/rest_api/test_tasks.py

cvat/apps/engine/background.py

zhiltsov-max · 2025-05-09T11:18:22Z

changelog.d/20250507_101308_maria_fix_handling_deferred_jobs.md

wrong number of arguments for 'watch' command seems to be fixed. I reproduced it on develop and haven't seen such errors during testing on this PR.

I still could reproduce the issue with multiple jobs (more than 2) from 1 user being queued / started in parallel, even without canceling any jobs. With canceling, in turn, I could get 3 more API issues:
2.1. It's possible to get HTTP response body: {"detail":"You don't have permission to perform this action"} when trying to retrieve a request if it was cancelled
2.2. It's possible to get empty status in a cancelled request, leading to failing response parsing in SDK
2.3. It's possible to get an RQ job parsing error rq.exceptions.NoSuchJobError: Unexpected job format: {'last_heartbeat': b'2025-05-08T15:40:48.317914Z', 'status': b'started', 'started_at': b'2025-05-08T15:40:48.317914Z', 'worker_name': b'390dc20f6f8b40bca0c37fdaa733401d'} during retrieveing a cancelled request

I don't think the newly found errors should be addressed in this PR. I'm not sure if the issue with jobs from one user was supposed to be fixed, but if so, it seems like it deserves a separate PR and a related deep investigation.

Several insights on 2:

I tested repeated parallel export requests in a loop for 3 jobs, and, probably, I could find all the known errors we met, even with 1 server and 2 export workers.

It's possible to get 2 jobs in the started state (or 1 started + 1 queued) from 1 user, because of the operation order during job finish in the RQ workers. They first put dependencies into the queue and only then report finishing to RQ. This is a kind of a false positive case.

It's possible to get multiple jobs from 1 user in the started state if the job fails because the worker has failed or there was an exception in the job. In this case, the started job is kept for some time, then becomes abandoned. Also a false positive.

If the job takes some time to be completed, errors can be reproduced much more reliably.

The job enqueueing in the dataset export endpoint seems to be correct overall.

I still could get hanging deferred jobs, even if their dependency has completed and there was no loop. I'm not sure if the added periodic job could resolve this problem entirely.

The error seems to happen only during job finishes with multiple workers. Export works fine in most cases, I couldn't reliably pollute the export queue.

Hypotheses:

The job dependency analysis can potentially be an issue, I couldn't discard this hypothesis so far.

There is the call to is_deferred, which does a status refresh. Changing this to get_status() doesn't solve the problem immediately.

Maybe dependencies can be missed if they change state during job dependency analysis, the code doesn't get notified about this.

Maybe there is some logic conflict with dependency pushing into the queue by RQ workers and jobs with same names

For further analysis, I'd enable job status logging from each worker (add a job UUID into meta for tracking different jobs with same rq id), then find if we really have cases where several jobs are incorrectly processed in parallel (taking into account the order of job finish operations in RQ workers). Currently, I was checking it manually in django queues in the admin panel, and it has some delay.

With canceling, in turn, I could get 3 more API issues:

I've created a separate issue for this

I still could get hanging deferred jobs, even if their dependency has completed and there was no loop. I'm not sure if the added periodic job could resolve this problem entirely.

Could you please comment on why you have such doubts?

For further analysis, I'd enable job status logging from each worker (add a job UUID into meta for tracking different jobs with same rq id), then find if we really have cases where several jobs are incorrectly processed in parallel (taking into account the order of job finish operations in RQ workers). Currently, I was checking it manually in django queues in the admin panel, and it has some delay.

This also deserves a separate issue/PR

Could you please comment on why you have such doubts?

I just didn't test calling the new cleanup function for that situation. I double checked it now, it seems to be working.

sonarqubecloud · 2025-05-12T11:50:24Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Marishka17 requested a review from SpecLad as a code owner April 2, 2025 11:52

Marishka17 force-pushed the mk/fix_handling_deferred_jobs branch from 043ac1c to d86870a Compare April 2, 2025 16:32

Draft fix

33643ac

Marishka17 force-pushed the mk/fix_handling_deferred_jobs branch from d86870a to 33643ac Compare April 2, 2025 16:41

Marishka17 changed the title ~~[WIP] Fix deferred jobs handling~~ Fix deferred jobs handling Apr 2, 2025

Marishka17 added 2 commits April 2, 2025 18:44

Merge branch 'develop' into mk/fix_handling_deferred_jobs

7c3facf

Update server schema

7bbb257

Marishka17 changed the title ~~Fix deferred jobs handling~~ [do not merge] Fix deferred jobs handling Apr 3, 2025

Marishka17 requested a review from zhiltsov-max as a code owner April 3, 2025 11:14

Marishka17 force-pushed the mk/fix_handling_deferred_jobs branch from 7427e7a to 445bc28 Compare April 3, 2025 11:42

Marishka17 changed the title ~~[do not merge] Fix deferred jobs handling~~ [Dependent] Fix deferred jobs handling Apr 3, 2025

Update exception in test

00ee82c

Marishka17 force-pushed the mk/fix_handling_deferred_jobs branch from 445bc28 to 00ee82c Compare April 3, 2025 11:44

Marishka17 added 2 commits April 8, 2025 12:47

Remove fixme comment

43e68d9

Disable test_cannot_export_backup_for_task_without_data test

9ab82f9

Marishka17 changed the title ~~[Dependent] Fix deferred jobs handling~~ [wip] Fix deferred jobs handling Apr 10, 2025

Marishka17 added 2 commits April 16, 2025 15:03

Patch Job.cancel method

dd43303

Update test

efb9758

Remove unused import

8c763e9

Marishka17 changed the title ~~[wip] Fix deferred jobs handling~~ Fix deferred jobs handling May 6, 2025

Marishka17 added 3 commits May 6, 2025 13:59

Merge branch 'develop' into mk/fix_handling_deferred_jobs

8f18332

update server schema

88d333b

fix test

7e0e14d

Marishka17 added 3 commits May 7, 2025 10:49

Reduce the number of cron job runs per day

9729545

Raise BackgroundRequestException when creating a task using SDK

d9639b4

Add changelog

45e3e49

Marishka17 requested a review from nmanovic as a code owner May 7, 2025 08:56

Merge branch 'develop' into mk/fix_handling_deferred_jobs

a183177

sync with upstream chnages

bf40b9f

Marishka17 removed request for SpecLad and nmanovic May 7, 2025 14:13