airflow: Scheduler encounters database update error, then gets stuck in endless loop, yet still shows as healthy
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Airflow version: v2.3.3+astro.2
.
We’ve encounter this issue twice this year. Something causes the Scheduler to get stuck in an endless loop, yet it shows as healthy even though nothing is being processed.
The last time we encounter this issue, this week. The Scheduler encountered a database update error:
sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'dag' expected to update 1 row(s); 0 were matched.
As a result, the Schedule logs should it’s stuck in an endless loop, the same messages are repeating over-and-over.
Because of this, nothing runs, and the entire Airflow instance is considered down.
In this particular case, the issue was resolved by manually deleting the duplicate row in the dag
table.
When we encounter a similar case earlier in the year, the root cause was different and required a different solution. (Upsizing workers).
What you think should happen instead
The Scheduler should not crash or get stuck in an endless loop. It should handle exceptional cases gracefully. It should not be reported as healthy if it is crashing continuously or stuck in an endless loop.
Some strategies for handling this, off the top of my head:
- The Scheduler should have stricter error handling and when an error is encountered, it should log the error, and continue on to the next scheduled DAG.
- The Scheduler itself should not be allowed to get into an endless loop.
- Check the logs for repeating message patterns?
- Keep a count to make sure DAGs are being run?
- Use logarithmic or exponential backoff when retrying?
How to reproduce
Enter a duplicate row in the dags
table. There are probably other ways. Earlier in the year we encounter this same issues when Workers were not properly upsized.
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-http==2.0.1 apache-airflow-providers-jdbc==2.0.1 simple-salesforce==1.1.0 csvvalidator==1.2 pandas==1.3.5 pre-commit pylint==2.15 pytest==6.2.5 pyspark==3.3.0 apache-airflow-providers-google==6.4.0
Deployment
Astronomer
Deployment details
Astronomer
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Possibly Similar Issues
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 18 (10 by maintainers)
That would be enough https://github.com/apache/airflow/pull/28689 fix only for DB backends which supported SELECT FOR UPDATE, unfortunetly MySQL 5.7 not supported this.
Potentially someone could found a solution for MySQL 5.7 before EOL, but for avoid waiting this for days or months I would recommend upgrade to MySQL 8.0 now. Or if you could afford lost all of history and create everything from scratch you might choose Postgres as backend.
And just in case I would like to reminder for someone who found this issue that MariaDB is not supported database backend for Airflow.