-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Task stuck in queued" should not count against retries #38304
Comments
@potiuk As a new contributor to Airflow, I would like to work on this issue! |
@potiuk |
Could i check this issue @potiuk ? |
Assigned you |
@potiuk Currently if the task is struck in the queue for a longer time, we fail the task. To have a separate try_number for failed queue task, we may not know during the retry part, why the task failed. Like if it's due to fail on the run or fail due to struck in the queue. How do you think we can handle this case? airflow/airflow/jobs/scheduler_job_runner.py Lines 1541 to 1576 in 0af5d92
|
The retry logic handled here in taskinstance.py. It is failed and therefore checks if its eligible for retry, if yes its queued again. airflow/airflow/models/taskinstance.py Lines 2992 to 3015 in b6ff085
|
The thing is that task_instance is not created yet because ... the task is in the queue. So what needs to happen is that the whole logic should happen in scheduler - becuase it's the scheduler (and to be precise - executor) that realizes that task is queued state. And it should be a different handling in executor, not in task instance -that's the whole complexity of the task. |
Got it ... let me check where this is handled in the executor part where the failed task is moved to the queue again but the count is deducted from the try_number. |
@potiuk task_instance would be created with state as queued right? i could understand this should be handled in executor part as how the _fail_tasks_stuck_in_queued is handled. But I can see that TI models are queried to find the queued tasks and that could mean that task_instance object is created. If I am missing to understand the point you have specified, please let me know. airflow/airflow/jobs/scheduler_job_runner.py Lines 1541 to 1576 in 0af5d92
|
it's good understanding - it should likely be done somewhere there. |
@collinmcnulty I know this is a long time, but can you share with me the executor type on which these task instances are running. |
@Bowrna Mine were running on celery executor, but I think the solution ought to be agnostic to the executor used. |
@collinmcnulty I have tagged you in this link #39398 (comment) and we can have further conversation on this part in that PR. |
Description
I think Airflow should have a configureable number of attempts for re-attempting to launch a task if it was killed for being stuck in queued for too long. Currently, such re-attempts consume task retries, but these are conceptually distinct from a task failing to run at all.
Use case/motivation
On a certain task that happens to not be idempotent, an Airflow user sets retries to zero intentionally, as a human will need to examine if the task can be safely retried or if manual intervention is necessary. However, if the same task is killed for being stuck in queued, the task never started, so the lack of idempotency does not matter and the task should definitely be re-attempted. Airflow currently does not allow a user to express this set of preferences.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: