airflow: Scheduler fails to schedule DagRuns due to persistent DAG record lock
Apache Airflow version
Other Airflow 2 version (please specify below)
If “Other Airflow 2 version” selected, which one?
2.7.3
What happened?
We are encountering an issue in our Apache Airflow setup where, after a few successful DagRuns, the scheduler stops scheduling new runs. The scheduler logs indicate:
{scheduler_job_runner.py:1426} INFO - DAG dag-test scheduling was skipped, probably because the DAG record was locked.
This problem persists despite running a single scheduler pod. Notably, reverting the changes from PR #31414 resolves this issue. A similar issue has been discussed on Stack Overflow: Airflow Kubernetes Executor Scheduling Skipped Because Dag Record Was Locked.
What you think should happen instead?
The scheduler should consistently schedule new DagRuns as per DAG configurations, without interruption due to DAG record locks.
How to reproduce
Run airflow v.2.7.3 on kubernetes. HA is not required. Trigger multiple DagRuns (We have about 10 DAGs that run every minute). Observe scheduler behavior and logs after a few successful runs. The error shows up after a few minutes
Operating System
centos7
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.10.0 apache-airflow-providers-apache-hive==6.2.0 apache-airflow-providers-apache-livy==3.6.0 apache-airflow-providers-cncf-kubernetes==7.8.0 apache-airflow-providers-common-sql==1.8.0 apache-airflow-providers-ftp==3.6.0 apache-airflow-providers-google==10.11.0 apache-airflow-providers-http==4.6.0 apache-airflow-providers-imap==3.4.0 apache-airflow-providers-papermill==3.4.0 apache-airflow-providers-postgres==5.7.1 apache-airflow-providers-presto==5.2.1 apache-airflow-providers-salesforce==5.5.0 apache-airflow-providers-snowflake==5.1.0 apache-airflow-providers-sqlite==3.5.0 apache-airflow-providers-trino==5.4.0
Deployment
Other
Deployment details
We have wrappers around the official airflow helm chart and docker images.
Environment:
Airflow Version: 2.7.3
Kubernetes Version: 1.24
Executor: KubernetesExecutor
Database: PostgreSQL (metadata database)
Environment/Infrastructure: Kubernetes cluster running Airflow in Docker containers
Anything else?
Actual Behavior: The scheduler stops scheduling new runs after a few DagRuns, with log messages about the DAG record being locked.
Workaround: Restarting the scheduler pod releases the lock and allows normal scheduling to resume, but this is not viable in production. Reverting the changes in PR #31414 also resolves the issue.
Questions/Request for Information:
- Under what scenarios is the lock on a DAG record typically not released?
- Are there known issues in Airflow 2.7.3, or specific configurations, that might cause the DAG record to remain locked, thereby preventing new run scheduling?
- Could the changes made in PR #31414 be related to this issue?
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: open
- Created 5 months ago
- Reactions: 10
- Comments: 50 (29 by maintainers)
We’ve had these issues lately and upgraded to 2.8.2 yesterday, but that didn’t solve the issue.
The message regarding DAGs being locked is gone, but the actual issue of runs not being queued persists.
Yeah I get that! S’alright. We might be in touch if we need help reproducing this.
@ashb sorry, my bad. From my perspective it wasn’t figured out but you’re right it was in fact someone in our internal Astronomer support team who closed the ticket. Sorry about that.
As some people reported above (i recommend you to read the whole thread) - you can try Airflow 2.8.2 - it solved the problem for some users. Then you can see if you do not have similar problem (different) with hugely increased database pool size - which apparently caused issues for another user. Then, ideally @renanxx1 you should report your finding here - seeing if any of those things worked for you. That would give us more confidence that similar problems might be solved by others by applying similar solutions. And it might give you @renanxx1 a chance to contribute back in diagnosing and fixing the issue.
And eventually if none of it work for you, providing more details, explaining the specific configuration you have and more details on circumstances - is a second best thing you can do to not only contribute back - but also possibly to speed up diagnosis and solutions.
Note that this software is developed by volunteers who spend their own time so that you can use the software for absolutely free and helping to diagnose problems by providing your findings is the least you can do to give back and thank those people who actually decided to spend their time so that people like you and companies like yours can use the software for free.
Note as well that the software you get is free and comes with no warranties and no support promise, so any diagnosis and analysis people do here trying to help problems of the users who have them is done because they voluntarliy decided to help and spend their personal, free time (even if they could spend it with their families or for pleasure). So providing your findings and trying out the different things above is absolutely least you can do to thank them for that.
Can we count on your help @renanxx1 rather than just demanding a solution? That would be very useful and great thing for the community if you do.
Hmmm - that is an interesting note and might lead to some hypothesis why it happens @ephraimbuddy @kaxil and might help with reproduction
Hi, we are heavily affected by this. We are on 2.7.2.
Switching off SQL Alchemy Pool fixed this problem
We ran into this issue today and after following these steps it started working again.
I hope this can help someone!
Done https://github.com/apache/airflow/pull/37596
Agree
I’d suggest try again. If they could not figure it out with access to the system, then I am afraid it’s not gonna be any easier here as people here cannot do any more diagnosis on your system, and here everyone is trying to help in their free time, when they feel like it, so there is little chance someone will easily reproduce it - following the description. There at least you have an easily reproducible environment, where Astronomer has full control over - Ideal situation to run diagnosis.
I think you should insist there. They have a lot of expertise there, and if they get strong signal that people don’t upgrade because of that issue, AND the fact it happens in the controlled environment of Astronomer AND is easily reproducible there - makes it far more feasible to diagnose the problem. I migh ping a few people in Astronomer to take a closer look if you will help with reproducibility case there @ruarfff