airflow: Random "duplicate key value violates unique constraint" errors when initializing the postgres database

Apache Airflow version

2.3.0 (latest released)

What happened

while testing airflow 2.3.0 locally (using postgresql 12.4), the webserver container shows random errors:

webserver_1  | + airflow db init
...
webserver_1  | + exec airflow webserver
...
webserver_1  | [2022-05-04 18:58:46,011] {{manager.py:568}} INFO - Added Permission menu access on Permissions to role Admin
postgres_1   | 2022-05-04 18:58:46.013 UTC [41] ERROR:  duplicate key value violates unique constraint "ab_permission_view_role_permission_view_id_role_id_key"
postgres_1   | 2022-05-04 18:58:46.013 UTC [41] DETAIL:  Key (permission_view_id, role_id)=(204, 1) already exists.
postgres_1   | 2022-05-04 18:58:46.013 UTC [41] STATEMENT:  INSERT INTO ab_permission_view_role (id, permission_view_id, role_id) VALUES (nextval('ab_permission_view_role_id_seq'), 204, 1) RETURNING ab_permission_view_role.id
webserver_1  | [2022-05-04 18:58:46,015] {{manager.py:570}} ERROR - Add Permission to Role Error: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "ab_permission_view_role_permission_view_id_role_id_key"
webserver_1  | DETAIL:  Key (permission_view_id, role_id)=(204, 1) already exists.
webserver_1  |
webserver_1  | [SQL: INSERT INTO ab_permission_view_role (id, permission_view_id, role_id) VALUES (nextval('ab_permission_view_role_id_seq'), %(permission_view_id)s, %(role_id)s) RETURNING ab_permission_view_role.id]
webserver_1  | [parameters: {'permission_view_id': 204, 'role_id': 1}]

notes:

  1. when the db is first initialized, i have ~40 errors like this (with ~40 different permission_view_id but always the same 'role_id': 1)
  2. when it’s not the first time initializing db, i always have 1 error like this but it shows different permission_view_id each time
  3. all these errors don’t seem to have any real negative effects, the webserver is still running and airflow is still running and scheduling tasks
  4. “occasionally” i do get real exceptions which render the webserver workers all dead:
postgres_1   | 2022-05-05 20:03:30.580 UTC [44] ERROR:  duplicate key value violates unique constraint "ab_permission_view_role_permission_view_id_role_id_key"
postgres_1   | 2022-05-05 20:03:30.580 UTC [44] DETAIL:  Key (permission_view_id, role_id)=(214, 1) already exists.
postgres_1   | 2022-05-05 20:03:30.580 UTC [44] STATEMENT:  INSERT INTO ab_permission_view_role (id, permission_view_id, role_id) VALUES (nextval('ab_permission_view_role_id_seq'), 214, 1) RETURNING ab_permission_view_role.id
webserver_1  | [2022-05-05 20:03:30 +0000] [121] [ERROR] Exception in worker process
webserver_1  | Traceback (most recent call last):
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_context
webserver_1  |     self.dialect.do_execute(
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
webserver_1  |     cursor.execute(statement, parameters)
webserver_1  | psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "ab_permission_view_role_permission_view_id_role_id_key"
webserver_1  | DETAIL:  Key (permission_view_id, role_id)=(214, 1) already exists.
webserver_1  |
webserver_1  |
webserver_1  | The above exception was the direct cause of the following exception:
webserver_1  |
webserver_1  | Traceback (most recent call last):
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
webserver_1  |     worker.init_process()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/base.py", line 134, in init_process
webserver_1  |     self.load_wsgi()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
webserver_1  |     self.wsgi = self.app.wsgi()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/gunicorn/app/base.py", line 67, in wsgi
webserver_1  |     self.callable = self.load()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
webserver_1  |     return self.load_wsgiapp()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
webserver_1  |     return util.import_app(self.app_uri)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/gunicorn/util.py", line 412, in import_app
webserver_1  |     app = app(*args, **kwargs)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/airflow/www/app.py", line 158, in cached_app
webserver_1  |     app = create_app(config=config, testing=testing)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/airflow/www/app.py", line 146, in create_app
webserver_1  |     sync_appbuilder_roles(flask_app)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/airflow/www/app.py", line 68, in sync_appbuilder_roles
webserver_1  |     flask_app.appbuilder.sm.sync_roles()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/airflow/www/security.py", line 580, in sync_roles
webserver_1  |     self.update_admin_permission()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/airflow/www/security.py", line 562, in update_admin_permission
webserver_1  |     self.get_session.commit()
webserver_1  |   File "<string>", line 2, in commit
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1423, in commit
webserver_1  |     self._transaction.commit(_to_root=self.future)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 829, in commit
webserver_1  |     self._prepare_impl()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 808, in _prepare_impl
webserver_1  |     self.session.flush()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 3255, in flush
webserver_1  |     self._flush(objects)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 3395, in _flush
webserver_1  |     transaction.rollback(_capture_exception=True)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
webserver_1  |     compat.raise_(
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
webserver_1  |     raise exception
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 3355, in _flush
webserver_1  |     flush_context.execute()
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 453, in execute
webserver_1  |     rec.execute(self)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 576, in execute
webserver_1  |     self.dependency_processor.process_saves(uow, states)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/dependency.py", line 1182, in process_saves
webserver_1  |     self._run_crud(
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/dependency.py", line 1245, in _run_crud
webserver_1  |     connection.execute(statement, secondary_insert)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1200, in execute
webserver_1  |     return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 313, in _execute_on_connection
webserver_1  |     return connection._execute_clauseelement(
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1389, in _execute_clauseelement
webserver_1  |     ret = self._execute_context(
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1748, in _execute_context
webserver_1  |     self._handle_dbapi_exception(
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1929, in _handle_dbapi_exception
webserver_1  |     util.raise_(
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
webserver_1  |     raise exception
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_context
webserver_1  |     self.dialect.do_execute(
webserver_1  |   File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
webserver_1  |     cursor.execute(statement, parameters)
webserver_1  | sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "ab_permission_view_role_permission_view_id_role_id_key"
webserver_1  | DETAIL:  Key (permission_view_id, role_id)=(214, 1) already exists.
webserver_1  |
webserver_1  | [SQL: INSERT INTO ab_permission_view_role (id, permission_view_id, role_id) VALUES (nextval('ab_permission_view_role_id_seq'), %(permission_view_id)s, %(role_id)s) RETURNING ab_permission_view_role.id]
webserver_1  | [parameters: {'permission_view_id': 214, 'role_id': 1}]
webserver_1  | (Background on this error at: http://sqlalche.me/e/14/gkpj)
webserver_1  | [2022-05-05 20:03:30 +0000] [121] [INFO] Worker exiting (pid: 121)
flower_1     | + exec airflow celery flower
scheduler_1  | + exec airflow scheduler
webserver_1  | [2022-05-05 20:03:31 +0000] [118] [INFO] Worker exiting (pid: 118)
webserver_1  | [2022-05-05 20:03:31 +0000] [119] [INFO] Worker exiting (pid: 119)
webserver_1  | [2022-05-05 20:03:31 +0000] [120] [INFO] Worker exiting (pid: 120)
worker_1     | + exec airflow celery worker

However such exceptions are rare and pure random, i can’t find a way to reproduce them consistently.

What you think should happen instead

prior to 2.3.0 there were no such errors

How to reproduce

No response

Operating System

Linux Mint 20.3

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 16
  • Comments: 70 (55 by maintainers)

Commits related to this issue

Most upvoted comments

I don’t think downgrade library will work? Seems like we have to wait for v2.3.1.

I am also facing exact thing for v2.3.0 using docker-compose installation

Not sure if this helps speed up the troubleshooting process but I’m seeing the same duplicate key value violates unique constraint error when running airflow in standalone mode with one gunicorn worker.

interestingly, I only started getting this error message when I switched AUTH_TYPE from AUTH_DB to AUTH_OAUTH

image

How do you get past the error? Every time our webserver pod spins up it fails on this, kills the gunicorn works, and then restarts.

if the preload fix is indeed the fix, you can suppress this for now by setting the GUNICORN_CMD_ARGS environment variable to --preload

No confusion. There were two similar issues - one of them (for some users) fixed in 2.4.2 and another similar issue for some other users that will be released in 2.4.3. The milestone is always “current” thinking and if it moves, it moves.

i checked out 2.4.2 and did

wget -qO - https://github.com/apache/airflow/pull/27297.patch | git apply -v -3

then built my own airflow

breeze release-management prepare-airflow-package --package-format=wheel --verbose

then installed it

pip install apache_airflow-2.4.2-py3-none-any.whl[...] --constraint ...

no more “duplicate key value violates unique constraint” errors 🚀 🚀 🚀 🚀 🚀 🚀

Just to repeat it @jhtimmins - I think you are the only one who somewhat knows what the problem is about and what the solution might be (and you attempted to fix it in #24065). Do you have a clue what it could be? I think I would be willing to help with that and maybe even be able to provide some proposals to improve, but I have no clue where to start and what the reason might be (and seems that this one is still plaguing some users).

Any please where we could start looking at/ideas/hypotheses what could be wrong?

both https://github.com/apache/airflow/blob/ee100a592e07761f3b32294f3ad82c4a6c7cf74d/airflow/www/app.py#L136 and https://github.com/apache/airflow/blob/ee100a592e07761f3b32294f3ad82c4a6c7cf74d/airflow/www/app.py#L146 are doing clashing reads/writes to the database once this line https://github.com/apache/airflow/blob/ee100a592e07761f3b32294f3ad82c4a6c7cf74d/airflow/cli/commands/webserver_command.py#L475 spins up 4 (by default) gunicorn workers. so i guess if i do this

            os.environ['SKIP_DAGS_PARSING'] = 'True'
            app = create_app(None)
            os.environ.pop('SKIP_DAGS_PARSING')

            # set webserver.update_fab_perms to "False" here 
            # would cause the workers not to do the clashing reads/writes

            with subprocess.Popen(run_args, close_fds=True) as gunicorn_master_proc:
                monitor_gunicorn(gunicorn_master_proc.pid)

i know it’s weird to alter it on the fly. an alternative is to set an env var such as

os.environ['UPDATE_FAB_PERMS_DONE'] = 'True'

and then check it along side with webserver.update_fab_perms in https://github.com/apache/airflow/blob/ee100a592e07761f3b32294f3ad82c4a6c7cf74d/airflow/www/app.py#L67 and https://github.com/apache/airflow/blob/ee100a592e07761f3b32294f3ad82c4a6c7cf74d/airflow/www/extensions/init_appbuilder.py#L643

i tried the env var solution, it worked. but it’s janky. it requires an extra env var 🤔