airflow: The scheduler does not appear to be running. Last heartbeat was received X minutes ago.
Apache Airflow version
2.1.4
Operating System
Linux / Ubuntu Server
Versions of Apache Airflow Providers
apache-airflow-providers-ftp==2.0.1 apache-airflow-providers-http==2.0.1 apache-airflow-providers-imap==2.0.1 apache-airflow-providers-postgres==2.3.0
Deployment
Virtualenv installation
Deployment details
Airflow v2.1.4 Postgres 14 LocalExecutor Installed with Virtualenv / ansible - https://github.com/idealista/airflow-role
What happened
I run a single BashOperator (for a long running task, we have to download data for 8+ hours initially to download from the rate-limited data source API, then download more each day in small increments).
We’re only using 3% CPU and 2 GB of memory (out of 64 GB) but the scheduler is unable to run any other simple task at the same time.
Currently only the long task is running, everything else is queued, even thought we have more resources:
What you expected to happen
I expect my long running BashOperator task to run, but for airflow to have the resources to run other tasks without getting blocked like this.
How to reproduce
I run a command with bashoperator (I use it because I have python, C, and rust programs being scheduled by airflow).
bash_command='umask 002 && cd /opt/my_code/ && /opt/my_code/venv/bin/python -m path.to.my.python.namespace'
Configuration:
airflow_executor: LocalExecutor
airflow_database_conn: 'postgresql+psycopg2://airflow:airflow_pass@localhost:5432/airflow'
airflow_database_engine_encoding: utf-8
airflow_database_engine_collation_for_ids:
airflow_database_pool_enabled: True
airflow_database_pool_size: 3
airflow_database_max_overflow: 10
airflow_database_pool_recycle: 2000
airflow_database_pool_pre_ping: True
airflow_database_schema:
airflow_database_connect_args:
airflow_parallelism: 10
airflow_dag_concurrency: 7
airflow_dags_are_paused_at_creation: True
airflow_max_active_runs_per_dag: 16
airflow_load_examples: False
airflow_load_default_connections: False
airflow_plugins_folder: "{{ airflow_app_home }}/plugins"
# [operators]
airflow_operator_default_owner: airflow
airflow_operator_default_cpus: 1
airflow_operator_default_ram: 512
airflow_operator_default_disk: 512
airflow_operator_default_gpus: 0
airflow_default_queue: default
airflow_allow_illegal_arguments: False
Anything else
This occurs every time consistently, also on 2.1.2
The other tasks have this state:
When the long-running task finishes, the other tasks resume normally. But I expect to be able to do some parallel execution /w LocalExecutor.
I haven’t tried using pgbouncer.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 52 (29 by maintainers)
@t4n1o - (but also others in this thread) a lot of the issues are because we have diffuculties with seeing clear reproduction of the problem - and we have to rely on users like you to spend their time on trying to analyze the settings, configuration, deployment they have, perform enough analysis and get enought clues that those who know the code best can make intelligent guesses what is wrong even if there is no “full and clear reproduction steps”. It’s not that development time is valuable. Airflow is developed by > 2100 contributors - often people like you. The source code is available, anyone can take a look and while some people know more about some parts of code, you get the software for free, and you cannot “expect” those people to spend a lot of time on trying to figure out what’s wrong if they have no clear reproduction steps and enough clues.
Making analysis and enough evidences to see what you observe is the best you can do to pay back for the free software - and possibly give those looking here enough clues to fix or direct you how to solve the problem.
So absolutely - if you feel like looking at the code and analysing it is something you can offer the community as your “pay back” - this is fantastic.
The scheduler Job is here: https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py.
But we can give you more than that: https://www.youtube.com/watch?v=DYC4-xElccE - this is video from Airlfow Summit 2021 where Ash explains how scheduler works - i.e. what were the design assumptions. And it can guide you in understanding what Scheduler Job does.
Also, before you dive deep, it might well be that your set of DAGs and way you structure them is a problem and you can simply follow our guidelines on Fine tuning your scheduler performance
So if you want to help - absolutely, make more analysis, look at the guidelines of ours, if you feel like it, dive deep into how scheduler works and look at the code. All that might be great way to get more clues, and evidences, and even if you won’t be able to fix it in a PR you can give others enough clues that that they can find root cause and implement solutions.
Since the assumption seems to be that this is an isolated issue, I just want to report that we are seeing it, too. After successfully migrating to v2 several months ago, started seeing this message 3 days ago. It’s getting worse everyday (started reporting “heartbeat not detected in 2 minutes” before resolving, now it’s up to 5 minutes).
This is what airflow is designed for. I tihnk you just use it wrongly (or misconfigured it). It is supposed to handle that case perfectly (and it works this way for thousands of users. So it’s your configuration/setup/way of using it is wrong.
No. This is no the case (unless you use Sequential Executor which is only suposed to be used for debugging). . Airflow is designed to run multiple paralllel tass at a time:. You likely have some problem in your airlfow installation/configuration
Questions:
Do you actually have scheduler running at all? Does it have coninuous access to the DB?
Are you absolutely sure you are not using SequentialExecutior ? What does your
airflow info
say - can you paste-bin output of it ? (airlfow has built-in flag to send to pastebin). Please make sure also that you do it in exactly the way your scheduler works. Miost likely you run your scheduler with a different configuration than your webserver and that causes the problem.Are you sure you are using Postgres and not Sqlite? What does your
airflow info
say?Where is your Python code (non-DAG)? Did you .airflowignore non-DAG files from airflow’s DAG folder?
can you upgrade to Airlfow 2.2.3 (latest released) - it has built-in warnings in case you use Sequential Executor/SQLite in the UI.
Can you change your DAGs to:
I just run it in 2.2.3 and I was able to successuly start even 5 paralllel runs and no problems with Scheduler
LocalExecutor:
In this screenshot the scheduler is running 4 of the same process / task, because
max_active_runs
was not set (I subsequently set it to 1, because that’s the behaviour I want).As stated above, the issue is airflow will not run other dags and the scheduler is not responding. (Strangely, the scheduler is apparently also quite happy to run 1 task from 1 dag in 4 parallel processes.)
I suspect some value in the configuration or not enough database connections.