airflow: Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing
Apache Airflow version: 2.0.0
Kubernetes version (if you are using kubernetes) (use kubectl version
):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.14-gke.1600", GitCommit:"7c407f5cc8632f9af5a2657f220963aa7f1c46e7", GitTreeState:"clean", BuildDate:"2020-12-07T09:22:27Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Cloud provider or hardware configuration: GKE
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
- Airflow metadata database is hooked up to a PostgreSQL instance
What happened:
- Airflow 2.0.0 running on the
KubernetesExecutor
has many tasks stuck in “scheduled” or “queued” state which never get resolved. - The setup has a
default_pool
of 16 slots. - Currently no slots are used (see Screenshot), but all slots are queued.
- No work is executed any more. The Executor or Scheduler is stuck.
- There are many many tasks stuck in “scheduled” state
- Tasks in “scheduled” state say
('Not scheduling since there are %s open slots in pool %s and require %s pool slots', 0, 'default_pool', 1)
That is simply not true, because there is nothing running on the cluster and there are always 16 tasks stuck in “queued”.
- Tasks in “scheduled” state say
- There are many tasks stuck in “queued” state
- Tasks in “queued” state say
Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
That is also not true. Nothing is running on the cluster and Airflow is likely just lying to itself. It seems the KubernetesExecutor and the scheduler easily go out of sync.
- Tasks in “queued” state say
What you expected to happen:
- Airflow should resolve scheduled or queued tasks by itself once the pool has available slots
- Airflow should use all available slots in the pool
- It should be possible to clear a couple hundred tasks and expect the system to stay consistent
How to reproduce it:
-
Vanilla Airflow 2.0.0 with
KubernetesExecutor
on Python3.7.9
-
requirements.txt
pyodbc==4.0.30 pycryptodomex==3.9.9 apache-airflow-providers-google==1.0.0 apache-airflow-providers-odbc==1.0.0 apache-airflow-providers-postgres==1.0.0 apache-airflow-providers-cncf-kubernetes==1.0.0 apache-airflow-providers-sftp==1.0.0 apache-airflow-providers-ssh==1.0.0
-
The only reliable way to trigger that weird bug is to clear the task state of many tasks at once. (> 300 tasks)
Anything else we need to know:
Don’t know, as always I am happy to help debug this problem. The scheduler/executer seems to go out of sync and never back in sync again with the state of the world.
We actually planned to upscale our Airflow installation with many more simultaneous tasks. With these severe yet basic scheduling/queuing problems we cannot move forward at all.
Another strange, likely unrelated observation, the scheduler always uses 100% of the CPU. Burning it. Even with no scheduled or now queued tasks, its always very very busy.
Workaround:
The only workaround for this problem I could find so far, is to manually go in, find all tasks in “queued” state and clear them all at once. Without that, the whole cluster/Airflow just stays stuck like it is.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 145 (67 by maintainers)
This is happening with Celery executor as well. I’m using Airflow 2.0.0 with Celery executor and mysql, facing similar issue. Sorry for the basic question but I’m unable to figure-out the manual way to find all tasks in “queued” state and clearing them. Can somebody help here.
We were running into a similar issue as we have 100+ dags and around a 1000 tasks.
I figured out there is a bug in the
celery_executor
which I still want to fix myself and contribute.Summary of that problem: At the start of the scheduler, the celery_executor class instance of the scheduler picks up everything from ‘dead’ schedulers (your previous run). That is (if you run one scheduler) every TaskInstance in the Running, Queued or Scheduled state. Then once it verified that this task is not running(takes 10 minutes), it clears most of the references but forgets a crucial one, making it such that the scheduler can NEVER start this task anymore. You can still start it via the webserver because that has its own celery_executor class instance.
What we noticed:
[2021-06-14 14:07:31,932] {base_executor.py:152} DEBUG - -62 open slots
What you can do to verify whether you have the same issue:
Our fix:
Hope this helps anyone and saves you a couple days of debugging 😃
Back on this. I am currently observing the behaviour again.
I can confirm:
The issue persists with
2.0.1
please update the tag accordingly.The issue is definitely “critical” as it halts THE ENTIRE airflow operation…!
I had the same problem today and I think I found the problem.
I’m testing with: Apache Airflow version:
2.0.1
Executor:Celery
Running locallyI was testing one dag and after changing a few parameters in one of the tasks in the dag file and cleaning the tasks, the task got stuck on scheduled state. The problem: The changes that I made broke the task, something like this:
So, the worker was refusing to execute because I was passing an invalid argument to the task. The problem is that the worker doesn’t notify (or update the task status to running) the scheduler/web that the file is wrong (no alert of a broken dag was being show in the Airflow home page).
After updating the task parameter and cleaning the task, it ran successfully.
Ps.: Probably is not the same problem that the OP is having but it’s related to task stuck on scheduled
@ephraimbuddy I found the root cause for my problem, and a way to reproduce it. Keep in mind my stack (Airflow + KubernetesExecutor), as this issue has been watered down by many different stacks and situations, ending with the same symptoms.
Steps to reproduce:
Solution:
I will unsubscribe from this issue.
I have not encountered this issue again (Airflow 2.1.2). But the following circumstances made this issue pop up again:
How do I prevent this issue?
I simply make sure the DAG code is 100% clean and loads both in the scheduler and the tasks that start up (using KubernetesExecutor).
How do I recover from this issue?
First, I fix the issue that prevents the DAG code from loading. Then I restart the scheduler.
@kaxil we’re still experiencing this issue in version 2.1.1 even after tuning
parallelism
andpool
size to 1024Thanks a lot @ashb. This command allowed our team to investigate the issue better and finally helped me figure the problem out. Expect a PR from me this week 😃
Airflow 2.0.2+e494306fb01f3a026e7e2832ca94902e96b526fa (MWAA on AWS)
This happens to us a LOT: a DAG will be running, task instances will be marked as “queued”, but nothing gets moved to “running”.
When this happened today (the first time today), I was able to track down the following error in the scheduler logs:
At some point after the scheduler had that exception, I tried to clear the state of the queued task instances to get them to run. That resulting in the following logs:
This corresponds to this section of code:
My conclusion is that when the scheduler experienced that error, it entered a pathological state: it was running but had bad state in memory. Specifically, the queued task instances were in the
queued_tasks
orrunning
in-memory cache, and thus any attempts to re-queue those tasks would fail as long as that scheduler process was running because the tasks would appear to already have been queued and/or running.Both caches use the
TaskInstanceKey
, which is made up ofdag_id
(which we can’t change),task_id
(which we can’t change),execution_date
(nope, can’t change), andtry_number
(🎉 we can change this!!).So to work around this, I created a utility DAG that will find all task instances in a “queued” or “None” state and increment the
try_number
field.The DAG runs as a single
PythonOperator
:To use the DAG, trigger a DAG run with a
dag_id
andexecution_date
like:Moments after I shipped this DAG, another DAG got stuck, and I had a chance to see if this utility DAG worked – it did! 😅
Couple of thoughts:
queued_tasks
andrunning
), but I don’t have a complete mental model of how the executor and the scheduler are distinct or not. My understanding is that this is scheduler code, but with the scheduler being high-availability (we’re running 3 schedulers), in-memory state seems like something we should be using very judiciously and be flushing and rehydrating from the database regularly.Thanks @easontm , It’s a bug. I have a PR trying to address that see https://github.com/apache/airflow/pull/17945
@ephraimbuddy no, I have to revisit this issue again.
I can now confirm, that tasks are stuck in “queued” even with the afroimentioned issue solved. I no longer have any missing DAGs, but the issue persists.
I now have a complete Airflow installation with 100+ DAGs that gets stuck every day or so. This is a critical issue. Please adjust your priority accordingly. Many other issues pop up with the same symptoms!
@kaxil I have checked,
min_file_process_interval
is set to30
, however the problem is still there for me.@SalmonTimo I have a pretty high CPU utilisation (60%), albeit the scheduler settings are default. But why? Does this matter?
––
Same issue, new day: I have Airflow running, the scheduler running, but the whole cluster has 103 scheduled tasks and 3 queued tasks, but nothing is running at all. I highly doubt that
min_file_process_interval
is the root of the problem. I suggest somebody mark that issue with a higher priority, I do not think that “regularly restarting the scheduler” is a reasonable solution.–
What we need here is some factual inspection of the Python process. I am no Python expert, however I am proficient and know myself around in other VMs (Erlang, Ruby).
Following that stack trace idea, I just learned that Python cannot dump a process (https://stackoverflow.com/a/141826/128351), unfortunately, otherwise I would have provided you with such a process dump of my running “scheduler”.
I am very happy to provide you with some facts about my stalled scheduler, if you tell me how you would debug such an issue.
What I currently have:
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL
is set to30
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL
is set to10
What I find striking, is the message
INFO - Not scheduling since there are 0 open slots in pool mssql_dwh
. That is a pool configured for max 3 slots. However no single task is running. I fear bug is, that the “scheduler” might be loosing track of running tasks on Kubernetes. Bluntly, I guess there is a bug in the components:@haninp - this might be (and likely is - because MWAA which plays a role here has no 2.2.4 support yet) completely different issue. It’s not helpful to say “I also have similar problem” without specifying details, logs .
As a “workaround” (or diagnosis) I suggest you to follow this FAQ here: https://airflow.apache.org/docs/apache-airflow/stable/faq.html?highlight=faq#why-is-task-not-getting-scheduled and double check if your problem is not one of those with the configuration that is explained there.
If you find you stil have a problem, then I invite you to describe it in detail in a separate issue (if this is something that is easily reproducible) or GitHub Discussion (if you have a problem but unsure how to reproduce it). Providing as many details such as your deployment details, logs, circumstances etc. are crucial to be able to help you. Just stating “I also have this problem” helps no-one (including yourself because you might thiink you delegated the problem and it will be solved, but in fact this might be a completely different problem.
I am facing the same issue in airflow-2.2.2 with kubernetes Executor.
Thanks @ephraimbuddy! That wasn’t it but I did find something that may be issue-worthy. I added some custom logging to
scheduler_job._start_queued_dagruns()
and noticed that the contents ofdag_runs
was the same 20 DAGruns in every loop (default count from configmax_dagruns_per_loop_to_schedule
) – the single DAG associated with these DAGruns is not the one pictured. The DAGruns in question show up first when ordering by the following code fromdagrun.next_dagruns_to_examine()
:This DAG is set to
max_active_runs=1
so all 20 examined queued DAGruns do not change state (because there is another running already). The problem arises because the_start_queued_dagruns()
function (AFAIK) doesn’t updatelast_scheduling_decision
, so every time the query is run to get next DAGruns, the same ones appear (and continue to not be scheduled if the currently active DAGrun for that DAG takes a long time – and it continues so long as there are more than 20 DAGruns queued).I think the
last_scheduling_decision
column needs to be updated somewhere here:I was able to get around this issue currently by simply increasing the number of DAGruns handled per loop (and telling my users not to queue so many), but perhaps it should still be addressed.
I’m on Airflow
2.1.0
deployed to ECS Fargate–seeing similar issues where they are 128 slots, a small number used, but queued tasks that aren’t ever run. Perhaps relatedly I also see a task in anone
state which I’m not able to track down from the UI. Cluster resources seem fine, using less than 50% CPU and memory.I also saw an error from Sentry around this time:
We are facing the same issue, with the kubernetes executor. K8s: 1.19.3. We use the helmchart from https://airflow-helm.github.io/charts (8.0.6), which refers to Airflow Version 2.0.1.
Running on 2.0.2 now, will keep you posted.
@lukas-at-harren – Can you check the Airflow Webserver -> Admin -> Pools and then in the row with your pool (
mssql_dwh
) check the Used slots. And click on the number in Used slots, it should take you to the TaskInstance page that should show the currently “running” taskinstances in that Pool.It is possible that they are not actually running but somehow got in that state in DB. If you see 3 entries over here, please mark those tasks as success or failed, that should clear your pool.
@mattvonrocketstein I can confirm the issue is still present with autoscaling disabled.
In our case, we dynamically create dags, so the MWAA team’s first suggestion was to reduce the load on the scheduler by increasing the refresh dags interval. It seems to help. We see fewer errors in the logs, and tasks getting stuck less often, but it didn’t resolve the issue.
Now we are waiting for the second round of suggestions.
Hi All! With release 2.2.5 scheduling issues have gone away for me. I guess one of the following resolved issues made it.
I am still using mostly SubDags instead of TaskGroups, since the latter makes the tree view incomprehensible. If you have a similar setup, then give 2.2.5 release a try!
Thanks for pointing me in the right direction, @potiuk. We’re planning to continue with our investigation when some resources free up to continue the migration.
Okay, I didn’t know that. In that case, I’m confused why the
try_number
increased withcommit()
but didn’t without it. 🤷@cdibble what’s your setting for
AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE
? Since your task instances gets to queued state, it’s not a scheduler issue but your executor. Since you’re on kubernetes, I suggest you increase the above configuration if you haven’t done soSure @val2k. I updated my previous comment to include the full DAG as well as a description of how to provide the DAG run config.
I ran into the same issue, the scheduler’s log:
Stucked task: dwd.dwd_ti_lgc_project The restart of the scheduler, webserver and executor does not do anything.
After clear state of the task, the executor does not received the task.
Version: v2.0.1 Git Version:.release:2.0.1+beb8af5ac6c438c29e2c186145115fb1334a3735
Couple questions to help you further:
@hafid-d / @trucnguyenlam - have you tried separating DAGs into separate pools? The more pools that are available the less ‘lanes’ you’ll have that are piled up. Just beware that in doing this, you are also somewhat breaking the parallelism - but it does organize your DAGs much better over time!
@Vbubblery A fix was included in 2.1.1, but it is not yet clear if that fixes all the cases of this behaviour or not.
@Jorricks Amazing! Shout (here, with a direct ping, or on Airflow slack) if I can help at all.
I have resolved this issue for my environment! I’m not sure if this is the same “bug” as others, or a different issue with similar symptoms. But here we go
In my Airflow Docker image, the entrypoint is just a bootstrap script that accepts
webserver
orscheduler
as arguments, and executes the appropriate command.bootstrap.sh
:Previous to #12766, the KubernetesExecutor fed the
airflow tasks run
(orairflow run
in older versions) into thecommand
section of pod YAML.This works fine for my setup – the
command
just overrides my Docker’sENTRYPOINT
, the pod executes its given command and terminates on completion. However, this change moved theairflow tasks run
issuance to theargs
section of the YAML.These new args do not match either
webserver
orscheduler
inbootstrap.sh
, therefore the script ends cleanly and so does the pod. Here is my solution, added to the bottom ofbootstrap.sh
:Rather than just allow the pod to execute whatever it’s given in
args
(aka just runningexec "$@"
without a check), I decided to at least make sure the pod is being fed anairflow run task
command.Any solution or workaround to fix this? This makes the scheduler very unreliable and the tasks are stuck in the queued state.
Any update on this ? Facing the same issue.
Environment:
I am facing the same issue again. This happens randomly and even after cleaning up the tasks instances and/or restarting the container is not fixing the issue.
airflow-scheduler | [2021-05-24 06:46:52,008] {scheduler_job.py:1105} INFO - Sending TaskInstanceKey(dag_id=‘airflow_health_checkup’, task_id=‘send_heartbeat’, execution_date=datetime.datetime(2021, 5, 24, 4, 21, tzinfo=Timezone(‘UTC’)), try_number=1) to executor with priority 100 and queue airflow_maintenance airflow-scheduler | [2021-05-24 06:46:52,008] {base_executor.py:85} ERROR - could not queue task TaskInstanceKey(dag_id=‘airflow_health_checkup’, task_id=‘send_heartbeat’, execution_date=datetime.datetime(2021, 5, 24, 4, 21, tzinfo=Timezone(‘UTC’)), try_number=1) airflow-scheduler | [2021-05-24 06:46:52,008] {scheduler_job.py:1105} INFO - Sending TaskInstanceKey(dag_id=‘airflow_health_checkup’, task_id=‘send_heartbeat’, execution_date=datetime.datetime(2021, 5, 24, 4, 24, tzinfo=Timezone(‘UTC’)), try_number=1) to executor with priority 100 and queue airflow_maintenance airflow-scheduler | [2021-05-24 06:46:52,009] {base_executor.py:85} ERROR - could not queue task TaskInstanceKey(dag_id=‘airflow_health_checkup’, task_id=‘send_heartbeat’, execution_date=datetime.datetime(2021, 5, 24, 4, 24, tzinfo=Timezone(‘UTC’)), try_number=1)
The only solution that works is manually deleting the dag… which isn’t a feasible option. This should be a very high priority as it breaks the scheduling.
in my case (Airflow 2.0.2), tasks seem to get stuck in
queued
state if the tasks happen to run during the deployment of Airflow (we are using ecs deploy to deploy) 🤔As many others here, we’re also experiencing 0 slots, and task scheduled but never being run on Airflow 1.10.14 on kubernetes. Restarting the scheduler pod did trigger the run for those stuck tasks.
We’ve updated to Airflow 2.0.2 and, after one DAG run, the problem did not resurface. None of the 2,000 or so tasks that are executed by the DAG ended up in a zombie state. We’re going to keep watch over this for the next few days, but the initial results look promising.
@ashb The setup works like this (I am @overbryd, just a different account.)
This scenario opens up the following edge cases/problems:
I can observe the same problem with version 2.0.2:
My current remedy:
My desired solution:
When a DAG/task goes missing while it is queued, it should end up in a failed state.
@ephraimbuddy ok. I have now set
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS
toFalse
. Let me see if it triggers again.@ephraimbuddy no I do not have that issue.
When I try to observe it closely, it always goes like this:
mssql_dwh
pool. All of them have the statequeued
. Nothing is running. Nothing is started. The scheduler does not start anything new, because the pool has 0 available slots.queued
state. Meanwhile, Kubernetes starts the pods.queued
state. Those are NOT running in Kubernetes.@kaxil So think there must be some kind of race condition between the scheduler and the Kubernetes pod startup. Some tasks finish really quickly (successfully so) and the scheduler KEEPS them in
queued
state.I’m facing the same issue as OP and unfortunately what @renanleme said does not apply to my situation. I read a lot about the
AIRFLOW__SCHEDULER__RUN_DURATION
that @jonathonbattista mentioned to be the solution, but first of all, this config does not appear to be implemented in airflow 2.X and second, as the Scheduler documentation says:I wouldn’t mind restarting the scheduler, but it is not clear for me the reason of the hanging queued tasks. In my environment, it appears to be very random.