airflow: Scheduler fails to schedule DagRuns due to persistent DAG record lock

Apache Airflow version

Other Airflow 2 version (please specify below)

If “Other Airflow 2 version” selected, which one?

2.7.3

What happened?

We are encountering an issue in our Apache Airflow setup where, after a few successful DagRuns, the scheduler stops scheduling new runs. The scheduler logs indicate:

{scheduler_job_runner.py:1426} INFO - DAG dag-test scheduling was skipped, probably because the DAG record was locked.

This problem persists despite running a single scheduler pod. Notably, reverting the changes from PR #31414 resolves this issue. A similar issue has been discussed on Stack Overflow: Airflow Kubernetes Executor Scheduling Skipped Because Dag Record Was Locked.

What you think should happen instead?

The scheduler should consistently schedule new DagRuns as per DAG configurations, without interruption due to DAG record locks.

How to reproduce

Run airflow v.2.7.3 on kubernetes. HA is not required. Trigger multiple DagRuns (We have about 10 DAGs that run every minute). Observe scheduler behavior and logs after a few successful runs. The error shows up after a few minutes

Operating System

centos7

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.10.0 apache-airflow-providers-apache-hive==6.2.0 apache-airflow-providers-apache-livy==3.6.0 apache-airflow-providers-cncf-kubernetes==7.8.0 apache-airflow-providers-common-sql==1.8.0 apache-airflow-providers-ftp==3.6.0 apache-airflow-providers-google==10.11.0 apache-airflow-providers-http==4.6.0 apache-airflow-providers-imap==3.4.0 apache-airflow-providers-papermill==3.4.0 apache-airflow-providers-postgres==5.7.1 apache-airflow-providers-presto==5.2.1 apache-airflow-providers-salesforce==5.5.0 apache-airflow-providers-snowflake==5.1.0 apache-airflow-providers-sqlite==3.5.0 apache-airflow-providers-trino==5.4.0

Deployment

Other

Deployment details

We have wrappers around the official airflow helm chart and docker images.

Environment:

Airflow Version: 2.7.3
Kubernetes Version: 1.24
Executor: KubernetesExecutor
Database: PostgreSQL (metadata database)
Environment/Infrastructure: Kubernetes cluster running Airflow in Docker containers

Anything else?

Actual Behavior: The scheduler stops scheduling new runs after a few DagRuns, with log messages about the DAG record being locked.

Workaround: Restarting the scheduler pod releases the lock and allows normal scheduling to resume, but this is not viable in production. Reverting the changes in PR #31414 also resolves the issue.

Questions/Request for Information:

  1. Under what scenarios is the lock on a DAG record typically not released?
  2. Are there known issues in Airflow 2.7.3, or specific configurations, that might cause the DAG record to remain locked, thereby preventing new run scheduling?
  3. Could the changes made in PR #31414 be related to this issue?

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

About this issue

  • Original URL
  • State: open
  • Created 5 months ago
  • Reactions: 10
  • Comments: 50 (29 by maintainers)

Most upvoted comments

We’ve had these issues lately and upgraded to 2.8.2 yesterday, but that didn’t solve the issue.

The message regarding DAGs being locked is gone, but the actual issue of runs not being queued persists.

Yeah I get that! S’alright. We might be in touch if we need help reproducing this.

@ashb sorry, my bad. From my perspective it wasn’t figured out but you’re right it was in fact someone in our internal Astronomer support team who closed the ticket. Sorry about that.

Is there any solution for this issue ?

As some people reported above (i recommend you to read the whole thread) - you can try Airflow 2.8.2 - it solved the problem for some users. Then you can see if you do not have similar problem (different) with hugely increased database pool size - which apparently caused issues for another user. Then, ideally @renanxx1 you should report your finding here - seeing if any of those things worked for you. That would give us more confidence that similar problems might be solved by others by applying similar solutions. And it might give you @renanxx1 a chance to contribute back in diagnosing and fixing the issue.

And eventually if none of it work for you, providing more details, explaining the specific configuration you have and more details on circumstances - is a second best thing you can do to not only contribute back - but also possibly to speed up diagnosis and solutions.

Note that this software is developed by volunteers who spend their own time so that you can use the software for absolutely free and helping to diagnose problems by providing your findings is the least you can do to give back and thank those people who actually decided to spend their time so that people like you and companies like yours can use the software for free.

Note as well that the software you get is free and comes with no warranties and no support promise, so any diagnosis and analysis people do here trying to help problems of the users who have them is done because they voluntarliy decided to help and spend their personal, free time (even if they could spend it with their families or for pleasure). So providing your findings and trying out the different things above is absolutely least you can do to thank them for that.

Can we count on your help @renanxx1 rather than just demanding a solution? That would be very useful and great thing for the community if you do.

witching off SQL Alchemy Pool fixed this problem

Hmmm - that is an interesting note and might lead to some hypothesis why it happens @ephraimbuddy @kaxil and might help with reproduction

Hi, we are heavily affected by this. We are on 2.7.2.

Switching off SQL Alchemy Pool fixed this problem

We ran into this issue today and after following these steps it started working again.

  1. Pause all DAGs
  2. Going into Browse -> Task Instances and deleting all tasks that didn’t have any of the following states upstream_failed, failed, success, removed (I think it was only task instances with state = scheduled and skipped that we deleted if I remember correctly, not sure how relevant it is to delete skipped tasks)
  3. Going into Browse -> Dag Runs and delete all Dag runs that had a state other than failed or success (I think we removed state = running and queued if I remember correctly).
  4. Start DAGs again

I hope this can help someone!

I have spent some time trying to reproduce this but I haven’t been able to do so. I would like to suggest that we revert #31414, I looked at the issue it was trying to solve and I think it’s not a very serious one. cc @potiuk . If that’s Ok, I will revert and include the change in 2.8.2

Agree

Done https://github.com/apache/airflow/pull/37596

I have spent some time trying to reproduce this but I haven’t been able to do so. I would like to suggest that we revert #31414, I looked at the issue it was trying to solve and I think it’s not a very serious one. cc @potiuk . If that’s Ok, I will revert and include the change in 2.8.2

Agree

I’d suggest try again. If they could not figure it out with access to the system, then I am afraid it’s not gonna be any easier here as people here cannot do any more diagnosis on your system, and here everyone is trying to help in their free time, when they feel like it, so there is little chance someone will easily reproduce it - following the description. There at least you have an easily reproducible environment, where Astronomer has full control over - Ideal situation to run diagnosis.

I think you should insist there. They have a lot of expertise there, and if they get strong signal that people don’t upgrade because of that issue, AND the fact it happens in the controlled environment of Astronomer AND is easily reproducible there - makes it far more feasible to diagnose the problem. I migh ping a few people in Astronomer to take a closer look if you will help with reproducibility case there @ruarfff