airflow: scheduler gets stuck without a trace
Apache Airflow version:
Kubernetes version (if you are using kubernetes) (use kubectl version
):
Environment:
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others: What happened:
The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I’ve tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.
Scheduler configs, job_heartbeat_sec = 5 scheduler_heartbeat_sec = 5 executor = LocalExecutor parallelism = 32
Please help. I would be happy to provide any other information needed
What you expected to happen:
How to reproduce it:
Anything else we need to know:
Moved here from https://issues.apache.org/jira/browse/AIRFLOW-401
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 7
- Comments: 71 (43 by maintainers)
We just saw this on 2.0.1 when we added a largish number of new DAGs (We’re adding around 6000 DAGs total, but this seems to lock up when about 200 try to be scheduled at once).
Here’s py-spy stacktraces from our scheduler:
What I think is happening is that the pipe between the
DagFileProcessorAgent
and theDagFileProcessorManager
is full and is causing the Scheduler to deadlock.From what I can see the
DagFileProcessorAgent
only pulls data off the pipe in it’sheartbeat
andwait_until_finished
functions (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/utils/dag_processing.py#L374)and that the SchedulerJob is responsible for calling it’s
heartbeat
function each scheduler loop (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/scheduler_job.py#L1388).However, the SchedulerJob is blocked from calling
heartbeat
because it’s blocked forever trying to send data to the same full pipe as part of the_send_dag_callbacks_to_processor
in the_do_scheduling_
function causing a deadlock.+1 on this issue.
Airflow 2.0.1
CeleryExecutor.
7000 dags~ seems to happen under load (when we have a bunch all dags all kick off at midnight)
py-spy dump --pid 132 --locals
py-spy dump --pid 134 --locals
I just wanted to share that the User-Community Airflow Helm Chart now has a mitigation for this issue that will automatically restart the scheduler if no tasks are created within some threshold time.
It’s called the scheduler “Task Creation Check”, but its not enabled by default as, the “threshold” must be longer than your shorted DAG
schedule_interval
, which we dont know unless the user tells us.I’ve got a fix for the case reported by @MatthewRBruce (for 2.0.1) coming in 2.0.2
We had the same issue with Airflow on Google Cloud until increased the setting AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW The default value was 5, with a change to 60 our Airflow server started to perform very well, including on complex DAGs with around 1000 tasks each. Any scale-up was resting on the database concurrent connections limit, so the scheduler was not able to perform fast.
Hi @ashb I would like to report that we’ve been seeing something similar to this issue in Airflow 2.0.2 recently.
We are using airflow 2.0.2 with a single airflow-scheduler + a few airflow-worker using CeleryExecutor and postgres backend running dozens of dags each with hundreds to a few thousand tasks. Python version is 3.8.7.
Here’s what we saw: airflow-scheduler sometimes stops heartbeating and stops scheduling any tasks. This seems to happen at random times, about once or twice a week. When this happens, the last line in the scheduler log shows the following, i.e. it stopped writing out any log after
receiving signal 15
. I did strace the airflow scheduler process. It did not capture any other process sending it signal 15. So most likely the signal 15 was sent by the scheduler to itself.When the scheduler was in this state, there was also a child
airflow scheduler
process shown inps
which was spawned by the mainairflow scheduler
process. I forgotpy-spy dump
, but I did usepy-spy top
to look at the childairflow scheduler
process. This was what I saw. It seems to be stuck somewhere incelery_executor.py::_send_tasks_to_celery
. This sounds similar to what @milton0825 reported previously although he mentioned he was using Airflow 1.10.8.When I manually SIGTERM the child airflow scheduler process, it died. And immediately the main
airflow scheduler
started to heartbeat and schedule tasks again like nothing ever happened. So I suspect somewhere when theairflow scheduler
was spawning a child processes, it got stuck. But I still don’t understand how it produced aExiting gracefully upon receiving signal 15
in the log.One other observation was that when the airflow scheduler was in the stuck state, the
DagFileProcessor
processes started by airflow scheduler were still running. I could see them writing out logs todag_processor_manager.log
.We have a change that correlates (causation is not yet verified) to fixing the issue the @sylr mentioned here where many scheduler main processes spawn at the same time then disappear (which caused an OOM error for us).
The change was the following:
And we run MAX_THREADS=10. Is it possible that reaching pool_size or pool_size+max_overflow caused processes to back up or spawn oddly? Before this change, the scheduler was getting stuck 1-2 times per day, now we have not seen this issue since the change 6 days ago.
We do not see the issue of many processes spawning at once anymore like this:
Can anyone else verify this change helps or not?
I’ve anecdotally noticed that once I’ve dropped argument
-n 25
from our scheduler invocation, I haven’t seen this issue come up since. Before, it would crop up every ~10 days or so and it’s been about a month now without incident.@sterling-jackson Your use case might be fixed by 2.1.0 (currently in RC stage)
Commenting to track this thread.
@dlamblin why was this closed? My read of this most recent comment was that it described a different issue than the one this issue refers to, and @ashb was pointing out that that bug was fixed, not necessarily the underlying one that this issue references.
If this issue is also fixed by that pull request, then great. I just want to be sure this issue isn’t being closed by mistake because this is still a huge issue for us.
Hi @ashb @davidcaron I managed to reproduce this issue consistently with a small reproducing example and traced the problem down to
reset_signals()
incelery_executor.py
. Since it feels like a different issue from the original one reported here, I opened a new issue: https://github.com/apache/airflow/issues/15938Have been struggling with this since we migrated to 2.0 our lower environments. Scheduler works for a couple of days, then stops scheduling, but doesn’t trigger any heartbeat errors. Not sure it’s helpful, but our PROD instance is running smoothly with Airflow 1.10.9 and Python 3.7.8.
Restarting the scheduler brings it back to life after Docker restarts the service.
Have a theory of why the Airflow scheduler may stuck at CeleryExecutor._send_tasks_to_celery (our scheduler stuck in a different place 😃).
The size of the return value from
send_task_to_executor
may be huge as the traceback is included in case of failure and looks like it is a known bug [1] in cpython that huge output can cause deadlock inmultiprocessing.Pool
.For example, the following code easily deadlock on Python 3.6.3:
[1] https://bugs.python.org/issue35267
any confirmation yet on whether this is fixed in 2.0?
@michaelosthege This behaviour should be fixed in 2.0.0 (now in beta stages) thanks to https://github.com/apache/airflow/pull/10956
If it helps, the last time this happened, with debug logging on, the scheduler logs this: ending.log before freezing forever and never heartbeating again
All system vitals like the disk, cpu, and mem are absolutely fine whenever the stuck happens for us. Whenever the process stuck, it doesn’t respond to any other kill signals except 9 & 11.
I did a strace on the stuck process, it shows the following
futex(0x14d9390, FUTEX_WAIT_PRIVATE, 0, NULL
Then I killed the process with
kill -11
and loaded the core in gdb, and below is the stack traceWe’ve experienced this issue twice now, with the CPU spiking to 100% and failing to schedule any tasks after. Our config is
Airflow 1.10.6 - Celery - Postgres
running on AWS ECS. I went back into our Cloudwatch logs and noticed the following collection of logs at the time the bug occurred:Which would point to the scheduler running out of memory, likely due to log buildup (I added log cleanup tasks retroactively). I’m not sure if this is related to the scheduler getting stuck though.
We are also facing the same issue with the
Airflow 1.10.4 - Mysql - Celery
combination. Found thatSchedule - DagFileProcessorManager
gets hung and we’ve to kill that to get the scheduler back.