airflow: The scheduler does not appear to be running. Last heartbeat was received X minutes ago.

Apache Airflow version

2.1.4

Operating System

Linux / Ubuntu Server

Versions of Apache Airflow Providers

apache-airflow-providers-ftp==2.0.1 apache-airflow-providers-http==2.0.1 apache-airflow-providers-imap==2.0.1 apache-airflow-providers-postgres==2.3.0

Deployment

Virtualenv installation

Deployment details

Airflow v2.1.4 Postgres 14 LocalExecutor Installed with Virtualenv / ansible - https://github.com/idealista/airflow-role

What happened

image

I run a single BashOperator (for a long running task, we have to download data for 8+ hours initially to download from the rate-limited data source API, then download more each day in small increments).

We’re only using 3% CPU and 2 GB of memory (out of 64 GB) but the scheduler is unable to run any other simple task at the same time.

Currently only the long task is running, everything else is queued, even thought we have more resources: image

What you expected to happen

I expect my long running BashOperator task to run, but for airflow to have the resources to run other tasks without getting blocked like this.

How to reproduce

I run a command with bashoperator (I use it because I have python, C, and rust programs being scheduled by airflow). bash_command='umask 002 && cd /opt/my_code/ && /opt/my_code/venv/bin/python -m path.to.my.python.namespace'

Configuration:

airflow_executor: LocalExecutor
airflow_database_conn: 'postgresql+psycopg2://airflow:airflow_pass@localhost:5432/airflow'
airflow_database_engine_encoding: utf-8
airflow_database_engine_collation_for_ids:
airflow_database_pool_enabled: True
airflow_database_pool_size: 3
airflow_database_max_overflow: 10
airflow_database_pool_recycle: 2000
airflow_database_pool_pre_ping: True
airflow_database_schema:
airflow_database_connect_args:
airflow_parallelism: 10
airflow_dag_concurrency: 7
airflow_dags_are_paused_at_creation: True
airflow_max_active_runs_per_dag: 16
airflow_load_examples: False
airflow_load_default_connections: False
airflow_plugins_folder: "{{ airflow_app_home }}/plugins"

# [operators]
airflow_operator_default_owner: airflow
airflow_operator_default_cpus: 1
airflow_operator_default_ram: 512
airflow_operator_default_disk: 512
airflow_operator_default_gpus: 0
airflow_default_queue: default
airflow_allow_illegal_arguments: False

Anything else

This occurs every time consistently, also on 2.1.2

The other tasks have this state: image

When the long-running task finishes, the other tasks resume normally. But I expect to be able to do some parallel execution /w LocalExecutor.

I haven’t tried using pgbouncer.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 52 (29 by maintainers)

Most upvoted comments

@t4n1o - (but also others in this thread) a lot of the issues are because we have diffuculties with seeing clear reproduction of the problem - and we have to rely on users like you to spend their time on trying to analyze the settings, configuration, deployment they have, perform enough analysis and get enought clues that those who know the code best can make intelligent guesses what is wrong even if there is no “full and clear reproduction steps”. It’s not that development time is valuable. Airflow is developed by > 2100 contributors - often people like you. The source code is available, anyone can take a look and while some people know more about some parts of code, you get the software for free, and you cannot “expect” those people to spend a lot of time on trying to figure out what’s wrong if they have no clear reproduction steps and enough clues.

Making analysis and enough evidences to see what you observe is the best you can do to pay back for the free software - and possibly give those looking here enough clues to fix or direct you how to solve the problem.

So absolutely - if you feel like looking at the code and analysing it is something you can offer the community as your “pay back” - this is fantastic.

The scheduler Job is here: https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py.

But we can give you more than that: https://www.youtube.com/watch?v=DYC4-xElccE - this is video from Airlfow Summit 2021 where Ash explains how scheduler works - i.e. what were the design assumptions. And it can guide you in understanding what Scheduler Job does.

Also, before you dive deep, it might well be that your set of DAGs and way you structure them is a problem and you can simply follow our guidelines on Fine tuning your scheduler performance

So if you want to help - absolutely, make more analysis, look at the guidelines of ours, if you feel like it, dive deep into how scheduler works and look at the code. All that might be great way to get more clues, and evidences, and even if you won’t be able to fix it in a PR you can give others enough clues that that they can find root cause and implement solutions.

Since the assumption seems to be that this is an isolated issue, I just want to report that we are seeing it, too. After successfully migrating to v2 several months ago, started seeing this message 3 days ago. It’s getting worse everyday (started reporting “heartbeat not detected in 2 minutes” before resolving, now it’s up to 5 minutes).

I’m not sure how airflow is intended to be used, but sometimes people find other use cases for a tool they haven’t designed.

We run a task that can take a few hours to collect all the historical data and process it. And then we want the task to run once per day.

This is what airflow is designed for. I tihnk you just use it wrongly (or misconfigured it). It is supposed to handle that case perfectly (and it works this way for thousands of users. So it’s your configuration/setup/way of using it is wrong.

It appears, from my side, that the airflow server UI can’t contact the scheduler while the long task is running, and other DAGs can’t be run. Perhaps the scheduler wants my code to yield control back to it frequently (once per day of data, for example), but I prefer to let my own code manage the date ranges, because that’s where the unit tests are, and all the heavy lifting is in rust anyway.

No. This is no the case (unless you use Sequential Executor which is only suposed to be used for debugging). . Airflow is designed to run multiple paralllel tass at a time:. You likely have some problem in your airlfow installation/configuration

Questions:

  1. Do you actually have scheduler running at all? Does it have coninuous access to the DB?

  2. Are you absolutely sure you are not using SequentialExecutior ? What does your airflow info say - can you paste-bin output of it ? (airlfow has built-in flag to send to pastebin). Please make sure also that you do it in exactly the way your scheduler works. Miost likely you run your scheduler with a different configuration than your webserver and that causes the problem.

  3. Are you sure you are using Postgres and not Sqlite? What does your airflow info say?

  4. Where is your Python code (non-DAG)? Did you .airflowignore non-DAG files from airflow’s DAG folder?

  5. can you upgrade to Airlfow 2.2.3 (latest released) - it has built-in warnings in case you use Sequential Executor/SQLite in the UI.

  6. Can you change your DAGs to:

default_args = {
    "owner": "t4n1o",
    "depends_on_past": False,
    "email": ["myemail@gmail.com"],
    "email_on_failure": True,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=2),
    "max_active_runs_per_dag": 1,

}
with DAG(
    "Bitmex_Archives_Mirror",
    default_args=default_args,
    description="Mirror the archives from public.bitmex.com ",
    schedule_interval=timedelta(days=1),
    start_date=days_ago(2),
    tags=["raw price history"],
    catchup=False,
) as dag:

    t1 = BashOperator(
        task_id="download_csv_gz_archives",
        bash_command="sleep 1000",
    )

    t2 = BashOperator(
        task_id="process_archives_into_daily_csv_files",
        depends_on_past=False,
        bash_command="sleep 1000",
        retries=3,
    )

    t1 >> t2

I just run it in 2.2.3 and I was able to successuly start even 5 paralllel runs and no problems with Scheduler

Screenshot from 2022-01-02 19-57-18

LocalExecutor: image

In this screenshot the scheduler is running 4 of the same process / task, because max_active_runs was not set (I subsequently set it to 1, because that’s the behaviour I want).

As stated above, the issue is airflow will not run other dags and the scheduler is not responding. (Strangely, the scheduler is apparently also quite happy to run 1 task from 1 dag in 4 parallel processes.)

I suspect some value in the configuration or not enough database connections.