airflow: scheduler gets stuck without a trace

Apache Airflow version:

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others: What happened:

The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I’ve tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.

Scheduler configs, job_heartbeat_sec = 5 scheduler_heartbeat_sec = 5 executor = LocalExecutor parallelism = 32

Please help. I would be happy to provide any other information needed

What you expected to happen:

How to reproduce it:

Anything else we need to know:

Moved here from https://issues.apache.org/jira/browse/AIRFLOW-401

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 7
Comments: 71 (43 by maintainers)

Most upvoted comments

We just saw this on 2.0.1 when we added a largish number of new DAGs (We’re adding around 6000 DAGs total, but this seems to lock up when about 200 try to be scheduled at once).

Here’s py-spy stacktraces from our scheduler:

Process 6: /usr/local/bin/python /usr/local/bin/airflow scheduler
Python v3.8.7 (/usr/local/bin/python3.8)
Thread 0x7FF5C09C8740 (active): "MainThread"
    _send (multiprocessing/connection.py:368)
    _send_bytes (multiprocessing/connection.py:411)
    send (multiprocessing/connection.py:206)
    send_callback_to_execute (airflow/utils/dag_processing.py:283)
    _send_dag_callbacks_to_processor (airflow/jobs/scheduler_job.py:1795)
    _schedule_dag_run (airflow/jobs/scheduler_job.py:1762)
    _do_scheduling (airflow/jobs/scheduler_job.py:1521)
    _run_scheduler_loop (airflow/jobs/scheduler_job.py:1382)
    _execute (airflow/jobs/scheduler_job.py:1280)
    run (airflow/jobs/base_job.py:237)
    scheduler (airflow/cli/commands/scheduler_command.py:63)
    wrapper (airflow/utils/cli.py:89)
    command (airflow/cli/cli_parser.py:48)
    main (airflow/__main__.py:40)
    <module> (airflow:8)
 
Process 77: airflow scheduler -- DagFileProcessorManager
Python v3.8.7 (/usr/local/bin/python3.8)
Thread 0x7FF5C09C8740 (active): "MainThread"
    _send (multiprocessing/connection.py:368)
    _send_bytes (multiprocessing/connection.py:405)
    send (multiprocessing/connection.py:206)
    _run_parsing_loop (airflow/utils/dag_processing.py:698)
    start (airflow/utils/dag_processing.py:596)
    _run_processor_manager (airflow/utils/dag_processing.py:365)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:315)
    _launch (multiprocessing/popen_fork.py:75)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:277)
    start (multiprocessing/process.py:121)
    start (airflow/utils/dag_processing.py:248)
    _execute (airflow/jobs/scheduler_job.py:1276)
    run (airflow/jobs/base_job.py:237)
    scheduler (airflow/cli/commands/scheduler_command.py:63)
    wrapper (airflow/utils/cli.py:89)
    command (airflow/cli/cli_parser.py:48)
    main (airflow/__main__.py:40)
    <module> (airflow:8)

What I think is happening is that the pipe between the DagFileProcessorAgent and the DagFileProcessorManager is full and is causing the Scheduler to deadlock.

From what I can see the DagFileProcessorAgent only pulls data off the pipe in it’s heartbeat and wait_until_finished functions (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/utils/dag_processing.py#L374)

and that the SchedulerJob is responsible for calling it’s heartbeat function each scheduler loop (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/scheduler_job.py#L1388).

However, the SchedulerJob is blocked from calling heartbeat because it’s blocked forever trying to send data to the same full pipe as part of the _send_dag_callbacks_to_processor in the _do_scheduling_ function causing a deadlock.

+12

MatthewRBruce on Feb 18, 2021

+1 on this issue.

Airflow 2.0.1

CeleryExecutor.

7000 dags~ seems to happen under load (when we have a bunch all dags all kick off at midnight)

py-spy dump --pid 132 --locals

Process 132: /usr/local/bin/python /usr/local/bin/airflow scheduler
Python v3.8.3 (/usr/local/bin/python)
Thread 132 (idle): "MainThread"
  _send (multiprocessing/connection.py:368)
      Arguments::
          self: <Connection at 0x7f5db7aac550>
          buf: <bytes at 0x5564f22e5260>
          write: <builtin_function_or_method at 0x7f5dbed8a540>
      Locals::
          remaining: 1213
  _send_bytes (multiprocessing/connection.py:411)
      Arguments::
          self: <Connection at 0x7f5db7aac550>
          buf: <memoryview at 0x7f5db66f4a00>
      Locals::
          n: 1209
          header: <bytes at 0x7f5dbc01fb10>
  send (multiprocessing/connection.py:206)
      Arguments::
          self: <Connection at 0x7f5db7aac550>
          obj: <TaskCallbackRequest at 0x7f5db7398940>
  send_callback_to_execute (airflow/utils/dag_processing.py:283)
      Arguments::
          self: <DagFileProcessorAgent at 0x7f5db7aac880>
          request: <TaskCallbackRequest at 0x7f5db7398940>
  _process_executor_events (airflow/jobs/scheduler_job.py:1242)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
          session: <Session at 0x7f5db80cf6a0>
      Locals::
          ti_primary_key_to_try_number_map: {("redeacted", "redeacted", <datetime.datetime at 0x7f5db768b540>): 1, ...}
          event_buffer: {...}
          tis_with_right_state: [("redeacted", "redeacted", <datetime.datetime at 0x7f5db768b540>, 1), ...]
          ti_key: ("redeacted", "redeacted", ...)
          value: ("failed", None)
          state: "failed"
          _: None
          filter_for_tis: <BooleanClauseList at 0x7f5db7427df0>
          tis: [<TaskInstance at 0x7f5dbbfd77c0>, <TaskInstance at 0x7f5dbbfd7880>, <TaskInstance at 0x7f5dbbfdd820>, ...]
          ti: <TaskInstance at 0x7f5dbbffba90>
          try_number: 1
          buffer_key: ("redeacted", ...)
          info: None
          msg: "Executor reports task instance %s finished (%s) although the task says its %s. (Info: %s) Was the task killed externally?"
          request: <TaskCallbackRequest at 0x7f5db7398940>
  wrapper (airflow/utils/session.py:62)
      Locals::
          args: (<SchedulerJob at 0x7f5dbed3dd00>)
          kwargs: {"session": <Session at 0x7f5db80cf6a0>}
  _run_scheduler_loop (airflow/jobs/scheduler_job.py:1386)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          is_unit_test: False
          call_regular_interval: <function at 0x7f5db7ac3040>
          loop_count: 1
          timer: <Timer at 0x7f5db76808b0>
          session: <Session at 0x7f5db80cf6a0>
          num_queued_tis: 17
  _execute (airflow/jobs/scheduler_job.py:1280)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          pickle_dags: False
          async_mode: True
          processor_timeout_seconds: 600
          processor_timeout: <datetime.timedelta at 0x7f5db7ab9300>
          execute_start_time: <datetime.datetime at 0x7f5db7727510>
  run (airflow/jobs/base_job.py:237)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          session: <Session at 0x7f5db80cf6a0>
  scheduler (airflow/cli/commands/scheduler_command.py:63)
      Arguments::
          args: <Namespace at 0x7f5db816f6a0>
      Locals::
          job: <SchedulerJob at 0x7f5dbed3dd00>
  wrapper (airflow/utils/cli.py:89)
      Locals::
          args: (<Namespace at 0x7f5db816f6a0>)
          kwargs: {}
          metrics: {"sub_command": "scheduler", "start_datetime": <datetime.datetime at 0x7f5db80f5db0>, ...}
  command (airflow/cli/cli_parser.py:48)
      Locals::
          args: (<Namespace at 0x7f5db816f6a0>)
          kwargs: {}
          func: <function at 0x7f5db8090790>
  main (airflow/__main__.py:40)
      Locals::
          parser: <DefaultHelpParser at 0x7f5dbec13700>
          args: <Namespace at 0x7f5db816f6a0>
  <module> (airflow:8)

py-spy dump --pid 134 --locals

Process 134: airflow scheduler -- DagFileProcessorManager
Python v3.8.3 (/usr/local/bin/python)
Thread 134 (idle): "MainThread"
  _send (multiprocessing/connection.py:368)
      Arguments::
          self: <Connection at 0x7f5db77274f0>
          buf: <bytes at 0x5564f1a76590>
          write: <builtin_function_or_method at 0x7f5dbed8a540>
      Locals::
          remaining: 2276
  _send_bytes (multiprocessing/connection.py:411)
      Arguments::
          self: <Connection at 0x7f5db77274f0>
          buf: <memoryview at 0x7f5db77d7c40>
      Locals::
          n: 2272
          header: <bytes at 0x7f5db6eb1f60>
  send (multiprocessing/connection.py:206)
      Arguments::
          self: <Connection at 0x7f5db77274f0>
          obj: (...)
  _run_parsing_loop (airflow/utils/dag_processing.py:698)
      Locals::
          poll_time: 0.9996239839999816
          loop_start_time: 690.422146969
          ready: [<Connection at 0x7f5db77274f0>]
          agent_signal: <TaskCallbackRequest at 0x7f5db678c8e0>
          sentinel: <Connection at 0x7f5db77274f0>
          processor: <DagFileProcessorProcess at 0x7f5db6eb1910>
          all_files_processed: False
          max_runs_reached: False
          dag_parsing_stat: (...)
          loop_duration: 0.0003760160000183532
  start (airflow/utils/dag_processing.py:596)
      Arguments::
          self: <DagFileProcessorManager at 0x7f5dbcb9c880>
  _run_processor_manager (airflow/utils/dag_processing.py:365)
      Arguments::
          dag_directory: "/code/src/dags"
          max_runs: -1
          processor_factory: <function at 0x7f5db7b30ee0>
          processor_timeout: <datetime.timedelta at 0x7f5db7ab9300>
          signal_conn: <Connection at 0x7f5db77274f0>
          dag_ids: []
          pickle_dags: False
          async_mode: True
      Locals::
          processor_manager: <DagFileProcessorManager at 0x7f5dbcb9c880>
  run (multiprocessing/process.py:108)
      Arguments::
          self: <ForkProcess at 0x7f5db7727220>
  _bootstrap (multiprocessing/process.py:315)
      Arguments::
          self: <ForkProcess at 0x7f5db7727220>
          parent_sentinel: 8
      Locals::
          util: <module at 0x7f5db8011e00>
          context: <module at 0x7f5dbcb8ba90>
  _launch (multiprocessing/popen_fork.py:75)
      Arguments::
          self: <Popen at 0x7f5db7727820>
          process_obj: <ForkProcess at 0x7f5db7727220>
      Locals::
          code: 1
          parent_r: 6
          child_w: 7
          child_r: 8
          parent_w: 9
  __init__ (multiprocessing/popen_fork.py:19)
      Arguments::
          self: <Popen at 0x7f5db7727820>
          process_obj: <ForkProcess at 0x7f5db7727220>
  _Popen (multiprocessing/context.py:276)
      Arguments::
          process_obj: <ForkProcess at 0x7f5db7727220>
      Locals::
          Popen: <type at 0x5564f1a439e0>
  start (multiprocessing/process.py:121)
      Arguments::
          self: <ForkProcess at 0x7f5db7727220>
  start (airflow/utils/dag_processing.py:248)
      Arguments::
          self: <DagFileProcessorAgent at 0x7f5db7aac880>
      Locals::
          mp_start_method: "fork"
          context: <ForkContext at 0x7f5dbcb9ce80>
          child_signal_conn: <Connection at 0x7f5db77274f0>
          process: <ForkProcess at 0x7f5db7727220>
  _execute (airflow/jobs/scheduler_job.py:1276)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          pickle_dags: False
          async_mode: True
          processor_timeout_seconds: 600
          processor_timeout: <datetime.timedelta at 0x7f5db7ab9300>
  run (airflow/jobs/base_job.py:237)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          session: <Session at 0x7f5db80cf6a0>
  scheduler (airflow/cli/commands/scheduler_command.py:63)
      Arguments::
          args: <Namespace at 0x7f5db816f6a0>
      Locals::
          job: <SchedulerJob at 0x7f5dbed3dd00>
  wrapper (airflow/utils/cli.py:89)
      Locals::
          args: (<Namespace at 0x7f5db816f6a0>)
          kwargs: {}
          metrics: {"sub_command": "scheduler", "start_datetime": <datetime.datetime at 0x7f5db80f5db0>, ...}
  command (airflow/cli/cli_parser.py:48)
      Locals::
          args: (<Namespace at 0x7f5db816f6a0>)
          kwargs: {}
          func: <function at 0x7f5db8090790>
  main (airflow/__main__.py:40)
      Locals::
          parser: <DefaultHelpParser at 0x7f5dbec13700>
          args: <Namespace at 0x7f5db816f6a0>
  <module> (airflow:8)

leonsmith on Mar 23, 2021

I just wanted to share that the User-Community Airflow Helm Chart now has a mitigation for this issue that will automatically restart the scheduler if no tasks are created within some threshold time.

It’s called the scheduler “Task Creation Check”, but its not enabled by default as, the “threshold” must be longer than your shorted DAG schedule_interval, which we dont know unless the user tells us.

thesuperzapper on Apr 20, 2022

I’ve got a fix for the case reported by @MatthewRBruce (for 2.0.1) coming in 2.0.2

ashb on Mar 31, 2021

We had the same issue with Airflow on Google Cloud until increased the setting AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW The default value was 5, with a change to 60 our Airflow server started to perform very well, including on complex DAGs with around 1000 tasks each. Any scale-up was resting on the database concurrent connections limit, so the scheduler was not able to perform fast.

oleksandr-yatsuk on Mar 25, 2021

Hi @ashb I would like to report that we’ve been seeing something similar to this issue in Airflow 2.0.2 recently.

We are using airflow 2.0.2 with a single airflow-scheduler + a few airflow-worker using CeleryExecutor and postgres backend running dozens of dags each with hundreds to a few thousand tasks. Python version is 3.8.7.

Here’s what we saw: airflow-scheduler sometimes stops heartbeating and stops scheduling any tasks. This seems to happen at random times, about once or twice a week. When this happens, the last line in the scheduler log shows the following, i.e. it stopped writing out any log after receiving signal 15. I did strace the airflow scheduler process. It did not capture any other process sending it signal 15. So most likely the signal 15 was sent by the scheduler to itself.

May 11 21:19:56 airflow[12643]: [2021-05-11 21:19:56,908] {base_executor.py:82} INFO - Adding to queue: ['airflow', 'tasks', 'run', ...]
May 11 21:19:56 airflow[12643]: [2021-05-11 21:19:56,973] {scheduler_job.py:746} INFO - Exiting gracefully upon receiving signal 15

When the scheduler was in this state, there was also a child airflow scheduler process shown in ps which was spawned by the main airflow scheduler process. I forgot py-spy dump, but I did use py-spy top to look at the child airflow scheduler process. This was what I saw. It seems to be stuck somewhere in celery_executor.py::_send_tasks_to_celery. This sounds similar to what @milton0825 reported previously although he mentioned he was using Airflow 1.10.8.

When I manually SIGTERM the child airflow scheduler process, it died. And immediately the main airflow scheduler started to heartbeat and schedule tasks again like nothing ever happened. So I suspect somewhere when the airflow scheduler was spawning a child processes, it got stuck. But I still don’t understand how it produced a Exiting gracefully upon receiving signal 15 in the log.

Total Samples 7859
GIL: 0.00%, Active: 0.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
  0.00%   0.00%   0.540s    0.540s   __enter__ (multiprocessing/synchronize.py:95)
  0.00%   0.00%   0.000s    0.540s   worker (multiprocessing/pool.py:114)
  0.00%   0.00%   0.000s    0.540s   _bootstrap (multiprocessing/process.py:315)
  0.00%   0.00%   0.000s    0.540s   _repopulate_pool (multiprocessing/pool.py:303)
  0.00%   0.00%   0.000s    0.540s   main (airflow/__main__.py:40)
  0.00%   0.00%   0.000s    0.540s   start (multiprocessing/process.py:121)
  0.00%   0.00%   0.000s    0.540s   _send_tasks_to_celery (airflow/executors/celery_executor.py:330)
  0.00%   0.00%   0.000s    0.540s   Pool (multiprocessing/context.py:119)
  0.00%   0.00%   0.000s    0.540s   run (airflow/jobs/base_job.py:237)
  0.00%   0.00%   0.000s    0.540s   _repopulate_pool_static (multiprocessing/pool.py:326)
  0.00%   0.00%   0.000s    0.540s   heartbeat (airflow/executors/base_executor.py:158)
  0.00%   0.00%   0.000s    0.540s   _launch (multiprocessing/popen_fork.py:75)
  0.00%   0.00%   0.000s    0.540s   wrapper (airflow/utils/cli.py:89)
  0.00%   0.00%   0.000s    0.540s   __init__ (multiprocessing/pool.py:212)
  0.00%   0.00%   0.000s    0.540s   _Popen (multiprocessing/context.py:277)

One other observation was that when the airflow scheduler was in the stuck state, the DagFileProcessor processes started by airflow scheduler were still running. I could see them writing out logs to dag_processor_manager.log.

yuqian90 on May 19, 2021

We have a change that correlates (causation is not yet verified) to fixing the issue the @sylr mentioned here where many scheduler main processes spawn at the same time then disappear (which caused an OOM error for us).

The change was the following:

AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE
- 5
+ 11
AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW
- 10
+ 30
AIRFLOW__CORE__SQL_ALCHEMY_POOL_RECYCLE
- 3600
+ 1800

And we run MAX_THREADS=10. Is it possible that reaching pool_size or pool_size+max_overflow caused processes to back up or spawn oddly? Before this change, the scheduler was getting stuck 1-2 times per day, now we have not seen this issue since the change 6 days ago.

We do not see the issue of many processes spawning at once anymore like this:

$ while true; do pgrep -f 'airflow scheduler' | wc -l; sleep .5; done
39
4
4
4
39
39
39
39
39
5
5
5
5
5
5
5
3
3
3
38
3
3
2
2
2
2
2
37
2
2
2
2
2
2
2
7
2
8
3
8
2
4
3
3
3
3
2
2
2
2
2
2
2
2
4
3
3
3
9
3
3
3
13
3
3
3
17
2
2
2
2
2
2
2
24
2
2
4

Can anyone else verify this change helps or not?

teastburn on Oct 14, 2020

I’ve anecdotally noticed that once I’ve dropped argument -n 25 from our scheduler invocation, I haven’t seen this issue come up since. Before, it would crop up every ~10 days or so and it’s been about a month now without incident.

chrismclennon on Jun 20, 2020

@sterling-jackson Your use case might be fixed by 2.1.0 (currently in RC stage)

ashb on May 18, 2021

Commenting to track this thread.

BwL1289 on Feb 3, 2021

@dlamblin why was this closed? My read of this most recent comment was that it described a different issue than the one this issue refers to, and @ashb was pointing out that that bug was fixed, not necessarily the underlying one that this issue references.

If this issue is also fixed by that pull request, then great. I just want to be sure this issue isn’t being closed by mistake because this is still a huge issue for us.

norwoodj on Nov 25, 2020

Hi @ashb @davidcaron I managed to reproduce this issue consistently with a small reproducing example and traced the problem down to reset_signals() in celery_executor.py. Since it feels like a different issue from the original one reported here, I opened a new issue: https://github.com/apache/airflow/issues/15938

yuqian90 on May 19, 2021

Have been struggling with this since we migrated to 2.0 our lower environments. Scheduler works for a couple of days, then stops scheduling, but doesn’t trigger any heartbeat errors. Not sure it’s helpful, but our PROD instance is running smoothly with Airflow 1.10.9 and Python 3.7.8.

Restarting the scheduler brings it back to life after Docker restarts the service.

Airflow 2.0.2
LocalExecutor (EC2)
Single scheduler, running in a Docker container, with and without host networking
Postgres backend running on RDS
Less than 100 DAGs running on this instance
Tasks executed on EKS via KubernetesPodOperator
Python version 3.8.9

sterling-jackson on Jun 15, 2021

Have a theory of why the Airflow scheduler may stuck at CeleryExecutor._send_tasks_to_celery (our scheduler stuck in a different place 😃).

The size of the return value from send_task_to_executor may be huge as the traceback is included in case of failure and looks like it is a known bug [1] in cpython that huge output can cause deadlock in multiprocessing.Pool.

For example, the following code easily deadlock on Python 3.6.3:

import multiprocessing
import time

def f(x):
    return ' ' * 1000000
if __name__ == '__main__':
    with multiprocessing.Pool(1) as p:
        r = p.map(f, ('hi'*100000))

[1] https://bugs.python.org/issue35267

milton0825 on Feb 24, 2021

any confirmation yet on whether this is fixed in 2.0?

krisdock on Jan 4, 2021

I had ~4800 tasks from the same DAG stuck after a manual reset, with the scheduler just killing PIDs. Turning other DAGs off and increasing DAGBAG_IMPORT_TIMEOUT did not help. Also restarting webserver/scheduler/redis/mysql had no effect.

After setting the “running” dagruns with the stuck tasks to “failed” and then back to “running” in smaller batches the scheduler managed to queue them.

(Airflow 1.10.10 with Celery)

@michaelosthege This behaviour should be fixed in 2.0.0 (now in beta stages) thanks to https://github.com/apache/airflow/pull/10956

ashb on Nov 13, 2020

If it helps, the last time this happened, with debug logging on, the scheduler logs this: ending.log before freezing forever and never heartbeating again

norwoodj on Oct 6, 2020

All system vitals like the disk, cpu, and mem are absolutely fine whenever the stuck happens for us. Whenever the process stuck, it doesn’t respond to any other kill signals except 9 & 11.

I did a strace on the stuck process, it shows the following futex(0x14d9390, FUTEX_WAIT_PRIVATE, 0, NULL

Then I killed the process with kill -11 and loaded the core in gdb, and below is the stack trace

(gdb) bt #0 0x00007fe49b18b49b in raise () from /lib64/libpthread.so.0 #1 <signal handler called> #2 0x00007fe49b189adb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #3 0x00007fe49b189b6f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #4 0x00007fe49b189c0b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #5 0x0000000000430bc5 in PyThread_acquire_lock_timed () #6 0x0000000000521a4c in acquire_timed () #7 0x0000000000521af6 in rlock_acquire () #8 0x00000000004826cd in _PyCFunction_FastCallDict () #9 0x00000000004f4143 in call_function () #10 0x00000000004f7971 in _PyEval_EvalFrameDefault () #11 0x00000000004f33c0 in _PyFunction_FastCall () #12 0x00000000004f40d6 in call_function () #13 0x00000000004f7971 in _PyEval_EvalFrameDefault () #14 0x00000000004f33c0 in _PyFunction_FastCall () #15 0x00000000004f40d6 in call_function () #16 0x00000000004f7971 in _PyEval_EvalFrameDefault () #17 0x00000000004f33c0 in _PyFunction_FastCall () #18 0x00000000004f40d6 in call_function () #19 0x00000000004f7971 in _PyEval_EvalFrameDefault () #20 0x00000000004f33c0 in _PyFunction_FastCall () #21 0x00000000004f40d6 in call_function ()

msumit on Sep 11, 2020

We’ve experienced this issue twice now, with the CPU spiking to 100% and failing to schedule any tasks after. Our config is Airflow 1.10.6 - Celery - Postgres running on AWS ECS. I went back into our Cloudwatch logs and noticed the following collection of logs at the time the bug occurred:

  | 2020-07-20T07:21:21.346Z | Process DagFileProcessor4357938-Process:
  | 2020-07-20T07:21:21.346Z | Traceback (most recent call last):
  | 2020-07-20T07:21:21.346Z | File "/usr/local/lib/python3.7/logging/__init__.py", line 1029, in emit
  | 2020-07-20T07:21:21.346Z | self.flush()
  | 2020-07-20T07:21:21.346Z | File "/usr/local/lib/python3.7/logging/__init__.py", line 1009, in flush
  | 2020-07-20T07:21:21.346Z | self.stream.flush()
  | 2020-07-20T07:21:21.346Z | OSError: [Errno 28] No space left on device
  | 2020-07-20T07:21:21.346Z | During handling of the above exception, another exception occurred:

Which would point to the scheduler running out of memory, likely due to log buildup (I added log cleanup tasks retroactively). I’m not sure if this is related to the scheduler getting stuck though.

sdzharkov on Sep 2, 2020

We are also facing the same issue with the Airflow 1.10.4 - Mysql - Celery combination. Found that Schedule - DagFileProcessorManager gets hung and we’ve to kill that to get the scheduler back.

msumit on Sep 2, 2020