airflow: Scheduler encounters database update error, then gets stuck in endless loop, yet still shows as healthy

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Airflow version: v2.3.3+astro.2.

We’ve encounter this issue twice this year. Something causes the Scheduler to get stuck in an endless loop, yet it shows as healthy even though nothing is being processed.

The last time we encounter this issue, this week. The Scheduler encountered a database update error:

sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'dag' expected to update 1 row(s); 0 were matched.

As a result, the Schedule logs should it’s stuck in an endless loop, the same messages are repeating over-and-over.

Screen_Shot_2022-10-24_at_10_25_21_AM

Because of this, nothing runs, and the entire Airflow instance is considered down.

In this particular case, the issue was resolved by manually deleting the duplicate row in the dag table.

When we encounter a similar case earlier in the year, the root cause was different and required a different solution. (Upsizing workers).

What you think should happen instead

The Scheduler should not crash or get stuck in an endless loop. It should handle exceptional cases gracefully. It should not be reported as healthy if it is crashing continuously or stuck in an endless loop.

Some strategies for handling this, off the top of my head:

  • The Scheduler should have stricter error handling and when an error is encountered, it should log the error, and continue on to the next scheduled DAG.
  • The Scheduler itself should not be allowed to get into an endless loop.
    • Check the logs for repeating message patterns?
    • Keep a count to make sure DAGs are being run?
    • Use logarithmic or exponential backoff when retrying?

How to reproduce

Enter a duplicate row in the dags table. There are probably other ways. Earlier in the year we encounter this same issues when Workers were not properly upsized.

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-http==2.0.1 apache-airflow-providers-jdbc==2.0.1 simple-salesforce==1.1.0 csvvalidator==1.2 pandas==1.3.5 pre-commit pylint==2.15 pytest==6.2.5 pyspark==3.3.0 apache-airflow-providers-google==6.4.0

Deployment

Astronomer

Deployment details

Astronomer

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Possibly Similar Issues

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

My db is mysql5.7,

That would be enough https://github.com/apache/airflow/pull/28689 fix only for DB backends which supported SELECT FOR UPDATE, unfortunetly MySQL 5.7 not supported this.

Potentially someone could found a solution for MySQL 5.7 before EOL, but for avoid waiting this for days or months I would recommend upgrade to MySQL 8.0 now. Or if you could afford lost all of history and create everything from scratch you might choose Postgres as backend.


And just in case I would like to reminder for someone who found this issue that MariaDB is not supported database backend for Airflow.