airflow: Scheduler crashes with psycopg2.errors.DeadlockDetected exception
Apache Airflow version
2.2.5 (latest released)
What happened
Customer has a dag that generates around 2500 tasks dynamically using a task group. While running the dag, a subset of the tasks (~1000) run successfully with no issue and (~1500) of the tasks are getting “skipped”, and the dag fails. The same DAG runs successfully in Airflow v2.1.3 with same Airflow configuration.
While investigating the Airflow processes, We found that both the scheduler got restarted with below error during the DAG execution.
[2022-04-27 20:42:44,347] {scheduler_job.py:742} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1256, in _execute_context
self.dialect.do_executemany(
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 912, in do_executemany
cursor.executemany(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL: Process 1646244 waits for ShareLock on transaction 3915993452; blocked by process 1640692.
Process 1640692 waits for ShareLock on transaction 3915992745; blocked by process 1646244.
HINT: See server log for query details.
CONTEXT: while updating tuple (189873,4) in relation "task_instance"
This issue seems to be related to #19957
What you think should happen instead
This issue was observed while running huge number of concurrent task created dynamically by a DAG. Some of the tasks are getting skipped due to restart of scheduler with Deadlock exception.
How to reproduce
DAG file:
from propmix_listings_details import BUCKET, ZIPS_FOLDER, CITIES_ZIP_COL_NAME, DETAILS_DEV_LIMIT, DETAILS_RETRY, DETAILS_CONCURRENCY, get_api_token, get_values, process_listing_ids_based_zip
from airflow.utils.task_group import TaskGroup
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
date = '{{ execution_date }}'
email_to = ['example@airflow.com']
# Using a DAG context manager, you don't have to specify the dag property of each task
state = 'Maha'
with DAG('listings_details_generator_{0}'.format(state),
start_date=datetime(2021, 11, 18),
schedule_interval=None,
max_active_runs=1,
concurrency=DETAILS_CONCURRENCY,
dagrun_timeout=timedelta(minutes=10),
catchup=False # enable if you don't want historical dag runs to run
) as dag:
t0 = DummyOperator(task_id='start')
with TaskGroup(group_id='group_1') as tg1:
token = get_api_token()
zip_list = get_values(BUCKET, ZIPS_FOLDER+state, CITIES_ZIP_COL_NAME)
for zip in zip_list[0:DETAILS_DEV_LIMIT]:
details_operator = PythonOperator(
task_id='details_{0}_{1}'.format(state, zip), # task id is generated dynamically
pool='pm_details_pool',
python_callable=process_listing_ids_based_zip,
task_concurrency=40,
retries=3,
retry_delay=timedelta(seconds=10),
op_kwargs={'zip': zip, 'date': date, 'token':token, 'state':state}
)
t0 >> tg1
Operating System
kubernetes cluster running on GCP linux (amd64)
Versions of Apache Airflow Providers
pip freeze | grep apache-airflow-providers
apache-airflow-providers-amazon==1!3.2.0 apache-airflow-providers-cncf-kubernetes==1!3.0.0 apache-airflow-providers-elasticsearch==1!2.2.0 apache-airflow-providers-ftp==1!2.1.2 apache-airflow-providers-google==1!6.7.0 apache-airflow-providers-http==1!2.1.2 apache-airflow-providers-imap==1!2.2.3 apache-airflow-providers-microsoft-azure==1!3.7.2 apache-airflow-providers-mysql==1!2.2.3 apache-airflow-providers-postgres==1!4.1.0 apache-airflow-providers-redis==1!2.0.4 apache-airflow-providers-slack==1!4.2.3 apache-airflow-providers-snowflake==2.6.0 apache-airflow-providers-sqlite==1!2.1.3 apache-airflow-providers-ssh==1!2.4.3
Deployment
Astronomer
Deployment details
Airflow v2.2.5-2 Scheduler count: 2 Scheduler resources: 20AU (2CPU and 7.5GB) Executor used: Celery Worker count : 2 Worker resources: 24AU (2.4 CPU and 9GB) Termination grace period : 2mins
Anything else
This issue happens in all the dag runs. Some of the tasks are getting skipped and some are getting succeeded and the scheduler fails with the Deadlock exception error.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 51 (43 by maintainers)
Thanks for REALLY detailed investigation @dstaple.
I finally had some time to take a look at this and I think your assesment was very correct.
However the solution you proposed is not good, because I think we DO want to run “SELECT FOR UPDATE” on DagRun table. The whole scheduling is based on the fact that DagRun row gets locked and no changes are happening to DagRun and any TaskInstances of that DagRun while Scheduler processes those task instances. And since
local_task_run
potentially changes the state of the task instance it runs (that’s why it locks it for update), if the whole task DagRun is currently “being processed” by any of the schedulers. we should hold-off with running the task before scheduler finishes this particular DagRun processing and releases the lock.And in this case the “local_task_run” actually locks the DagRun table too (though I am not entirely sure why this is one thing that I do not understand completely - see below). So it does what it should but with one very little caveat - it locks the TaskInstance and DagRun in REVERSE ORDER comparing to what Scheduler does. This is actually the root cause of ALL Deadlocks (at least in Postgres, MySQL has it’s own fair share of other kinds of deadlocks) - non-consistent order. The deadlock appears when two threads want two (or more) resources and gets lock on them in reverse order. This is actually the only reason for any kind of deadlocks and your investigation was really nicely showing what’s going on.
The solution to that is simple - since we are going to get the DagRun lock in a moment anyway in “refresh_from_db”, we should simply get the lock on DagRun table FIRST. This should fix the problem as we will then perform lock grabbing in the same sequence in scheduler and task_run - > first DagRun, then TaskInstance. This is what my proposed #25266 does.
The only thing I do not know is WHY the
TaskInstance.refresh_from_db
actually does the JOIN query:The original query in the code looks like this:
And there is no obvious reason why the last line joins the dag_run table?
I hope someone else in this thread might shed some light on it, I have a suspicion, that SQLALchemy will add the join in case there is a ForeignKey with ONCASCADE with the dag_id (which we have) - but I could not find any reference or documentation that would point to such behaviour.
@RNHTTR - since you mentioned you can reproduce the issue - maybe you could apply my fix and see if it solves the problem (there is a bit of leap of faith with this change).
@ldacey @whitleykeith @argibbs @eitanme -> I spoke with some enlightened people 😃 (yeah talking about you @ashb and @bbovenzi ) -> and after the talk I have a hypothesis, that this is the new Grid view doing auto-refresh for a long running DAG.
There was a fix by @ashb https://github.com/apache/airflow/pull/24284 that is going to be released in 2.3.3 which decreases significantly a number of queries that are generated by the Grid view refresh. It’s a huge improvement and might impact both - load on the DB and possibly memory usage of the webserver - especially if there are almost continuously running dags and a number of people leaves the browser open with “auto-refresh” on the Grid View.
Is there a way some of you could test the hypothesis and see if there might be a correlation (requires a bit of coordination what your users do).
(BTW. If that’s it then Ash’s fix is coming in 2.3.3).
I would advise against doing this while running multiple schedulers – if you do then it is entirely possible that Airflow will not correctly respect configured concurrency limits for DAGs/Tasks/Pools. Edit: oh, or it will crash
I’m on it!
@dstaple - would be great if you check. I think we can merge it regardlless (it’s super easy to revert) - so there is no problem with some later checking. I also was not sure if the DELETE issue is the same. It could be (and I have the scenario in my head):
DELETE DagRun with CASCADE on TI - first creates lock on the DagRun an only THEN an the TaskInstamce - very similarly to what Scheduler does.
And in this case the fix above should also help so @RNHTTR I’d appreciate checking it 😃
Ahhhhh. That Would indeed explain it. I tink then that my solution is actually the right approach 😃
Very useful Thanks. I will take a look at it shortly.
I am also encountering this issue. I collected some details about both queries involved in the deadlock, hopefully this is helpful.
Deployment details:
In the deadlocks there is an UPDATE statement deadlocking with a SELECT … FOR UPDATE.
Based on stack traces visible in the scheduler logs, the UPDATE originates from the main scheduler loop here: https://github.com/apache/airflow/blob/2.2.5/airflow/models/dagrun.py#L901-L910
Based on the database logs, the SELECT statement has the form:
Searching the Airflow source code, the query that looks most similar to the SELECT from the database error is in
TaskInstance.refresh_from_db()
: https://github.com/apache/airflow/blob/2.2.5/airflow/models/taskinstance.py#L714-L736Example scheduler logs showing the origins of the UPDATE statement:
Example Postgres logs showing a complete SELECT … FOR UPDATE statement:
Unfortunately we are not able to repro this on a test instance so I have not been able to try on newer Airflow versions, but based on the discussion on this thread it sounds like the issue is present until at least 2.3.2.
Indeed, it is ! My bad then for having set this param, thinking that postgres would allow it. Thanks for the help, and sorry if I wasted your time 🙏