airflow: Tasks being marked as failed even after running successfully
Apache Airflow version
2.7.1
What happened
When running a dag, a task’s logs will show that it ran successfully, and completed without error, but the task is marked as failed.
See (slightly redacted, sorry!) image an for example:
As you can see, the log ends with the messages:
[2023-09-13, 12:26:30 BST] {subprocess.py:97} INFO - Command exited with return code 0
[2023-09-13, 12:26:31 BST] {taskinstance.py:1398} INFO - Marking task as SUCCESS. dag_id=STOXX600_Index_1.ANC_110.AEE_110.QTR1.Realtime, task_id=Shared-Data.split-adjustments-combined, execution_date=20230913T090500, start_date=20230913T112612, end_date=20230913T112631
[2023-09-13, 12:26:31 BST] {local_task_job_runner.py:228} INFO - Task exited with return code 0
And yet the task has been marked as failed. 😱
Even more “interesting” (aka worrying), it’s been marked as failed on it’s first attempt - and yet, the task is configured with retries=2
. The retry attempts setting has been ignored.
My version history has been:
- 2.3.3
- 2.7.0
- 2.7.1
(I was naughty and did not keep up with releases). This issue is very definitely present on 2.7.0 and 2.7.1; I never saw it in 2.3.3
What you think should happen instead
I mean, stating the obvious, but:
- The task should not have been marked as failed.
- Even if it had been seen as failed, it should have retried.
How to reproduce
I have no idea (yet) how to reproduce this problem outside of our existing environments.
Things I am looking at / have ruled out:
- The problem exists in both our dev and uat envs. I have not upgraded prod, for obvious reasons.
- Memory - the workers and schedulers are given access to the full memory of the box; it’s shared with other processes, but even the smallest box has 5 Gb free at all times (including during the failure window)
- We currently run 3 schedulers; I am trying to run with a single scheduler to see if that improves the problem (since I am wondering if it’s because the schedulers are somehow conflicting).
Operating System
centos 7 running a docker image of 2.7.1
Versions of Apache Airflow Providers
These should be consistent with the constraints set by 2.7.1
apache-airflow-providers-celery==3.3.3
apache-airflow-providers-common-sql==1.7.1
apache-airflow-providers-docker==3.7.4
apache-airflow-providers-ftp==3.5.1
apache-airflow-providers-http==4.5.1
apache-airflow-providers-imap==3.3.1
apache-airflow-providers-microsoft-mssql==3.4.2
apache-airflow-providers-odbc==4.0.0
apache-airflow-providers-postgres==5.6.0
apache-airflow-providers-redis==3.3.1
apache-airflow-providers-slack==8.0.0
apache-airflow-providers-sqlite==3.4.3
Deployment
Docker-Compose
Deployment details
We build the airflow image ourselves (because we’ve been building it since 1.10, and it’s less hassle to bump the version number than trying to switch to the official image).
It’s using python 3.10 as a base.
Our deployment runs postgres 13.1 + bundled celery + redis 6.0.9 + workers on-prem. We have historically run 3 HA schedulers, but as noted I’m trying to run with a single scheduler to see if that improves performance.
The other (possibly related change) is that in 2.3.3 we had max_tis_per_query = 512
and in 2.7.x we are using max_tis_per_query = 16
as per the 2.7.0 release notes.
We also have the following config overrides which are unchanged since 2.3.3:
- AIRFLOW__CORE__PARALLELISM=1000
- AIRFLOW__SCHEDULER__PARSING_PROCESSES=3
- AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=120
Anything else
Our install has ~200 dags. Most dags are small, with only a handful of dags with >10 tasks per run.
The problem seems to happen fairly frequently (we run ~6 similar-looking dags 4 times a day, I see failures in multiple dags.
The behaviour sounds quite similar to #33155. I have also noticed tasks which don’t appear to have any logs, but I can trace the task in flower (celery) and I can see that the worker ran the task successfully; it’s even marked as success in flower. There’s just no logs in the airflow gui, and the task has been marked as failed. e.g. here are some logs from the worker for a task that is marked as failed in the GUI, with no log file, but clearly ran ok.
[2023-09-13 09:40:12,903: INFO/MainProcess] Task airflow.providers.celery.executors.celery_executor_utils.execute_command[723d73ae-323f-421b-8350-44a0c08c82d3] received
[2023-09-13 09:40:12,981: INFO/ForkPoolWorker-7] [723d73ae-323f-421b-8350-44a0c08c82d3] Executing command in Celery: ['airflow', 'tasks', 'run', '__redacted_dag_name__', 'Notebooks.notebook', 'scheduled__2023-09-13T07:05:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/pipeline_estimates.py']
[2023-09-13 09:41:45,673: INFO/ForkPoolWorker-7] Task airflow.providers.celery.executors.celery_executor_utils.execute_command[723d73ae-323f-421b-8350-44a0c08c82d3] succeeded in 92.75607865909114s: None
The upgrade to 2.7.1 was performed in isolation. Our dev and UAT envs are failing consistently, our prod env (which is still on 2.3.3) is running almost the exact same set of dags, and has no errors.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 28 (28 by maintainers)
I just thinking is it possible that combination of Airflow + Prefect + Celery could breaks occasionally something in multiprocessing if run in the same process.
My initial point not about starts “Holy War” lets keep it for reddit.