airflow: DagProcessor Performance Regression
Apache Airflow version
2.5.3
What happened
Upgrading from 2.4.3
to 2.5.3
caused a significant increase in dag processing time on standalone dag processor (~1-2s to 60s):
/opt/airflow/dags/ecco_airflow/dags/image_processing/product_image_load.py 0 -1 56.68s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/known_consumers/known_consumers.py 0 -1 56.64s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/monitoring/row_counts.py 0 -1 56.67s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/base.py 0 -1 56.66s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_data.py 0 -1 56.67s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_stream.py 0 -1 56.52s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/reporting/reporting_data_foundation.py 0 -1 56.63s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/retail_analysis/retail_analysis_dbt.py 0 -1 56.66s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/rfm_segments/rfm_segments.py 0 -1 56.02s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/utils/airflow.py 0 -1 56.65s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/aad_users_listing.py 1 0 55.51s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/funnel_io.py 1 0 56.13s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/iar_param.py 1 0 56.50s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/sfmc_copy.py 1 0 56.59s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/us_legacy_datawarehouse.py 1 0 55.15s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_auditing.py 1 0 56.54s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_budget_daily_phasing.py 1 0 56.63s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_gold_rm_tests.py 1 0 55.00s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/consumer_entity_matching/graph_entity_matching.py 1 0 56.67s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/data_backup/data_backup.py 1 0 56.69s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/hive/adhoc_entity_publish.py 1 0 55.33s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/image_regression/train.py 1 0 56.63s 2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/maintenance/db_maintenance.py 1 0 56.58s 2023-04-26T12:56:15
Also seeing messages like these
[2023-04-26T12:56:15.322+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/us_legacy_datawarehouse.py finished
[2023-04-26T12:56:15.323+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor68-Process' pid=116 parent=7 stopped exitcode=0>
[2023-04-26T12:56:15.323+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_gold_rm_tests.py finished
[2023-04-26T12:56:15.323+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor69-Process' pid=122 parent=7 stopped exitcode=0>
[2023-04-26T12:56:15.324+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/streaming/sap_inventory_feed.py finished
[2023-04-26T12:56:15.324+0000] {processor.py:314} DEBUG - Waiting for <ForkProcess name='DagFileProcessor70-Process' pid=128 parent=7 stopped exitcode=-SIGKILL>
[2023-04-26T12:56:15.324+0000] {manager.py:986} ERROR - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/streaming/sap_inventory_feed.py exited with return code -9.
In 2.4.3
:
/opt/airflow/dags/ecco_airflow/dags/image_regression/train.py 1 0 1.34s 2023-04-26T14:19:08
/opt/airflow/dags/ecco_airflow/dags/known_consumers/known_consumers.py 1 0 1.12s 2023-04-26T14:19:00
/opt/airflow/dags/ecco_airflow/dags/maintenance/db_maintenance.py 1 0 0.63s 2023-04-26T14:18:27
/opt/airflow/dags/ecco_airflow/dags/monitoring/row_counts.py 1 0 3.74s 2023-04-26T14:18:45
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_data.py 1 0 1.21s 2023-04-26T14:18:47
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_stream.py 1 0 1.22s 2023-04-26T14:18:30
/opt/airflow/dags/ecco_airflow/dags/reporting/reporting_data_foundation.py 1 0 1.39s 2023-04-26T14:19:08
/opt/airflow/dags/ecco_airflow/dags/retail_analysis/retail_analysis_dbt.py 1 0 1.32s 2023-04-26T14:18:51
/opt/airflow/dags/ecco_airflow/dags/rfm_segments/rfm_segments.py 1 0 1.20s 2023-04-26T14:18:34
What you think should happen instead
Dag processing time remains unchanged
How to reproduce
Provision Airflow with the following settings:
Airflow 2.5.3
- K8s 1.25.6
- Kubernetes executor
- Postgres backend (Postgres 11.0)
- Deploy using Airflow Helm v1.9.0 with image 2.5.3-python3.9
- pgbouncer enabled
- standalone dag processort with 3500m cpu / 4000Mi memory, single replica
- dags and logs mounted from RWM volume (Azure files)
Airflow 2.4.3
- K8s 1.25.6
- Kubernetes executor
- Postgres backend (Postgres 11.0)
- Deploy using Airflow Helm v1.7.0 with image 2.4.3-python3.9
- pgbouncer enabled
- standalone dag processort with 2500m cpu / 2000Mi memory, single replica
- dags and logs mounted from RWM volume (Azure files)
Image modifications
We use image built from apache/airflow:2.4.3-python3.9
, with some dependencies added/reinstalled with different versions.
Poetry dependency spec:
For 2.5.3
:
[tool.poetry.dependencies]
python = ">=3.9,<3.11"
authlib = "~1.0.1"
adapta = { version = "==2.2.3", extras = ["azure", "storage"] }
numpy = "==1.23.3"
db-dtypes = "~1.0.4"
gevent = "^21.12.0"
sqlalchemy = ">=1.4,<2.0"
snowflake-sqlalchemy = ">=1.4,<2.0"
esd-services-api-client = "~0.6.0"
apache-airflow-providers-common-sql = "~1.3.1"
apache-airflow-providers-databricks = "~3.1.0"
apache-airflow-providers-google = "==8.4.0"
apache-airflow-providers-microsoft-azure = "~5.2.1"
apache-airflow-providers-datadog = "~3.0.0"
apache-airflow-providers-snowflake = "~3.3.0"
apache-airflow = "==2.5.3"
dataclasses-json = ">=0.5.7,<0.6"
For 2.4.3
:
[tool.poetry.dependencies]
python = ">=3.9,<3.11"
authlib = "~1.0.1"
adapta = { version = "==2.2.3", extras = ["azure", "storage"] }
numpy = "==1.23.3"
db-dtypes = "~1.0.4"
gevent = "^21.12.0"
sqlalchemy = ">=1.4,<2.0"
snowflake-sqlalchemy = ">=1.4,<2.0"
esd-services-api-client = "~0.6.0"
apache-airflow-providers-common-sql = "~1.3.1"
apache-airflow-providers-databricks = "~3.1.0"
apache-airflow-providers-google = "==8.4.0"
apache-airflow-providers-microsoft-azure = "~5.2.1"
apache-airflow-providers-datadog = "~3.0.0"
apache-airflow-providers-snowflake = "~3.3.0"
apache-airflow = "==2.4.3"
dataclasses-json = ">=0.5.7,<0.6"
Operating System
Container OS: Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==6.0.0 apache-airflow-providers-celery==3.0.0 apache-airflow-providers-cncf-kubernetes==4.4.0 apache-airflow-providers-common-sql==1.3.4 apache-airflow-providers-databricks==3.1.0 apache-airflow-providers-datadog==3.0.0 apache-airflow-providers-docker==3.2.0 apache-airflow-providers-elasticsearch==4.2.1 apache-airflow-providers-ftp==3.3.1 apache-airflow-providers-google==8.4.0 apache-airflow-providers-grpc==3.0.0 apache-airflow-providers-hashicorp==3.1.0 apache-airflow-providers-http==4.3.0 apache-airflow-providers-imap==3.1.1 apache-airflow-providers-microsoft-azure==5.2.1 apache-airflow-providers-mysql==3.2.1 apache-airflow-providers-odbc==3.1.2 apache-airflow-providers-postgres==5.2.2 apache-airflow-providers-redis==3.0.0 apache-airflow-providers-sendgrid==3.0.0 apache-airflow-providers-sftp==4.1.0 apache-airflow-providers-slack==6.0.0 apache-airflow-providers-snowflake==3.3.0 apache-airflow-providers-sqlite==3.3.2 apache-airflow-providers-ssh==3.2.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
See How-to-reproduce section
Anything else
Occurs by upgrading the helm chart from 1.7.0/2.4.3 to 1.9.0/2.5.3 installation.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (10 by maintainers)
Commits related to this issue
- Fix dag file processor hertbeat to run only if necessary The standalone file processor as of #30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat but missing... — committed to potiuk/airflow by potiuk a year ago
- Fix dag file processor hertbeat to run only if necessary (#30899) The standalone file processor as of #30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat ... — committed to apache/airflow by potiuk a year ago
- Fix dag file processor hertbeat to run only if necessary (#30899) The standalone file processor as of #30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat bu... — committed to apache/airflow by potiuk a year ago
Combining answers here - I’ll try with 2.6.0rc or maybe released version if it gets released before I get an hour to do a test.
I can pinpoint a few things:
We always try import only what we need in DAG files, but sometimes there can be imports for typehinting purposes only.
No, never.
Infamous
Variabels.get(..)
is one thing that is against Airflow best-practices that we use in all dags. However, our Airflow is set to only run a parse for every 30s instead of 5s, and we only read a single variable (top-level json configuration for the pipeline)In some dags we also read 1-2 connections stored in Airflow db. That’s it.
See notes above.
Memory usage on dag file processor pod starts at around 180mb and grows to around 1.6gb on initial parse and then stabilizes around 700mb and cpu usage drops to ~2-3% from initial ~50%
No, we also have SLA and email/error callbacks fully disabled.
Please find below a full example of DAG that consists of 1 task only:
Cool. We will get RC3 soon
I think this is actually a very good point and comment. That actually makes me think that we should have DIFFERENT heartbeat expectations for local task (generally for each job type we have - as mentioned above historically they are the same)
I created an issue for that and I will implement it during some of the tasks I am working on soon as part of our AIP-44 work. Issue here https://github.com/apache/airflow/issues/30908
One more (really kind) request - not now but in case we do not find anything till you have the chance to experiment, but it almost looks like there is something that prevents the Variable.get to obtain the connection to retrieve anything from the database (and looks like this single Variable.get() is blocking everythig. That gives me some ideas to look (and why maybe it has not been experienced by others who have no Variables at the top-level code, but if we do not get there, just removing the Variables.get() and hardcoding the output - might be a quick way to test that hypothesis.
This is the Dockerfile content for our current image - also the one we tried to upgrade from:
I don’t have SHA unfortunately, and we do not rebuild images on a regular basis, so it might not be exactly the one available on Dockerhub right now.
Update to 2.5.3 is simple:
And poetry dependency spec I have provided in the issue description.
If you want to build the exact same image, you can use this GH workflow:
I can provide the lock file if you want to go this way.
Oh whoa! That’s what I call “deep” data (not a big but “deep”). 😃. Thanks for that - it’s super helpful. Quick question though - what was the version you had before 2.5.3 / rolled back to ?
And just to explain why - I will (in the meantime) try to reproduce it locally and see if we can observe similar behaviours on our installation, but just knowing that this is the cause could allow us to possibly remediate it with removal of that commit in 2.6.0 rc while we can even if we do not know exactly what is the mechanism behind the slow-down.
Adding more info:
let me know if you want to know helm values we pass when deploying
@george-zubrienko - I have some extra questions that might help to find out the root cause.
Does it happen all the time for all dags parsed or some of the time for some of the dags? Can you please descibe the pattern you see?
Could you please tell more (and possibly share some anonymised examples of) top-level code of the dags of yours that experience this behaviour and whether you use some of those things:
Generally speaking I would like to know your top-level parsing violates any of the best practices described in https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code ?
Do you see any excessive memory use and swapping while parsing happens for the DAG file processor?
Do you use callbacks in your DAGs that experience the problem?
Having those answers might help in investigating the root cuase - the problem you describe is not reproducible easily - so we need to know what’s so special about your DAGS or environment that triggers that behaviour. With the related #305903 it seemsed that the reason might be MySQL but knowing that you have Postgres suggests that either the problem is different or that it is something completely different than we suspected before.