airflow: Airflow 2.1.0 with Schedulers HA Failing

Discussed in https://github.com/apache/airflow/discussions/17126

<div type='discussions-op-text'>

Originally posted by sorabhgit July 21, 2021 Hello Guys , I am also struggling with issue while setting up schedulers HA with Airflow 2.1.0 version .

I’ve installed airflow scheduler on 2 separate nodes with both pointing to same mysql8 but gets below error in one of the airflow scheduler logs :

Steps to reproduce :

  1. Install Airflow 2.1.0 on 2 nodes using Mysql 8.0.25 .
  2. use_row_level_locking = True ( in airflow.cfg of both the nodes )
  3. Start scheduler,webserver,celery worker on node1 and just scheduler on node2 .
  4. Execute any example DAG and one of the scheduler will exit/failed with below error .
[^[[34m2021-07-01 08:15:04,342^[[0m] {^[[34mscheduler_job.py:^[[0m1302} ERROR^[[0m - Exception when executing SchedulerJob._run_scheduler_loop^[[0m
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/connection_cext.py", line 337, in get_rows
else self._cmysql.fetch_row()
_mysql_connector.MySQLInterfaceError: Statement aborted because lock(s) could not be acquired immediately and NOWAIT is set.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/cursor_cext.py", line 277, in execute
self._handle_result(result)
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/cursor_cext.py", line 172, in _handle_result
self._handle_resultset()
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/cursor_cext.py", line 671, in _handle_resultset
self._rows = self._cnx.get_rows()[0]
File "/usr/local/lib64/python3.6/site-packages/mysql/connector/connection_cext.py", line 368, in get_rows
sqlstate=exc.sqlstate)
mysql.connector.errors.DatabaseError: 3572 (HY000): Statement aborted because lock(s) could not be acquired immediately and NOWAIT is set.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1284, in _execute
num_queued_tis = self._do_scheduling(session)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1546, in _do_scheduling
num_queued_tis = self._critical_section_execute_task_instances(session=session)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1142, in _critical_section_execute_task_instances
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 900, in _executable_task_instances_to_queued
pools = models.Pool.slots_stats(lock_rows=True, session=session)
</div>

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (7 by maintainers)

Most upvoted comments

We support Scheduler HA (running more than one scheduler) https://airflow.apache.org/docs/apache-airflow/stable/concepts/scheduler.html?highlight=scheduler ha#running-more-than-one-scheduler - our Scheduler runs in Active/Active mode (which means that both schedulers are parsing DAGs at the same time). This is supported in MySQL 8+ and should work (of course there might be some edge cases, but generally we tested it and it works).

This is of course very different than Database HA. This is something that is outside of the realm of Airflow and is done by your deployment. From the very beginning we had the assumption, and we have developed Airflow 2 with the assumption that the Database is running at most in Active/Passive mode. The comment from #14788 indicated that someone had similar problem when running DB in active/active mode behind (and there switching to talk directly to only one physical DB helped). So my assumption was that one of the reasons is that you have similar setup.

Also - we’ve seen similar problems with various proxies which provided kind’a poor’s man DB HA, where the proxy had several physical DB clusters behind. We heavily base our Scheduler’s HA on Database locking, and locking is hard problem to solve in Active/Active setup.

That leads to the suggestion - that this might be similar case for you. If it is not and you are 100% sure that you have single physical DB behind then the problem needs deeper investigation and will take quite some time to resolve, and possibly some iterations here to find out the reason (because we have not seen it in our tests).

So if you are 100% sure you do not have multiple DBs being accessed at the same time (even if single proxy is used) then my advice will be to switch to Postgres, as it might take quite a lot of time to find out the cause (we’ve seen it in the past - sometimes people used customized versions of the databases with some functionality disabled for example). Postgres is much more stable, and less configurable (MySQL for example can have multiple engines with different capabilities) and there might be many other reasons why MySQL (especially custom-configured one) creates problems.

Unfortunlately we have no capacity to investigate and help individual users here in the community and investigate those cases deeply, so unless you have time and capacity to try to investigate it and provide more information, I am afraid it might take quite some time to even reproduce this kind of problem you have.

Going Postgres is much more “certain” route, and if you are keen on timing, I’d heartily recommend going that route.

@ashb something that we need to discuss when you return - it seems (needs confirmation) that some people connect Airlfow HA schedulers to a DB in active/active mode and it causes the locking problem (MySQL in this case as usual).

I think we might want to either be more explicit in Airflow about that, or detect it and inform the user (better) or possibly implement support for Active/Active mode (the best but might not be possible/easy). Happy to have a discussion on it when you are back from holidays 😉

  1. I think for now you can use single scheduler - until the problem is diagnosed and fixed. I am not sure we will be able to diagnose and find the root of it quickly, and almost for sure this will require change in Airflow which will likely take weeks ore months to release. So there is little chance that you will be unblocked quickly on that without workarounding it.

  2. I will keep on repeating it - MySQL has a LOT of problems comparing to Postgres. Locking, encoding, stability, you name it. If you still can, switch to Postgres.