airflow: Log files are still being cached causing ever-growing memory usage when scheduler is running

Apache Airflow version

2.4.1

What happened

My Airflow scheduler memory usage started to grow after I turned on the dag_processor_manager log by doing

export CONFIG_PROCESSOR_MANAGER_LOGGER=True

see the red arrow below

2022-10-11_12-06 (1)

By looking closely at the memory usage as mentioned in https://github.com/apache/airflow/issues/16737#issuecomment-917677177, I discovered that it was the cache memory that’s keep growing:

2022-10-12_14-42 (1)

Then I turned off the dag_processor_manager log, memory usage returned to normal (not growing anymore, steady at ~400 MB)

This issue is similar to #14924 and #16737. This time the culprit is the rotating logs under ~/logs/dag_processor_manager/dag_processor_manager.log*.

What you think should happen instead

Cache memory shouldn’t keep growing like this

How to reproduce

Turn on the dag_processor_manager log by doing

export CONFIG_PROCESSOR_MANAGER_LOGGER=True

in the entrypoint.sh and monitor the scheduler memory usage

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

k8s

Anything else

I’m not sure why the previous fix https://github.com/apache/airflow/pull/18054 has stopped working 🤔

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 25 (25 by maintainers)

Commits related to this issue

Most upvoted comments

worse than a red herring, this is a mirage 😆 can’t change the kernel’s behavior, i’ll just change meself:

LOGGING_CONFIG["handlers"]["processor_manager"].update(
    {
        'maxBytes': 10485760,  # 10M
        "backupCount": 3,
    }
)

this makes the cache memory usage cap at 40~50Mb

BTW. I’ve heard VERY bad things about EFS when EFS is used to share DAGs. It has profound impact on stability and performance of Airlfow if you have big number of DAGs unless you pay big bucks for IOPS. I’ve heard that from many people. This is the moment when I usually STRONGLY recommend GitSync instead: https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca

It’s always it depends on configuration and monitoring. I personally have this issue might be in Airflow 2.1.x and I do not know is it actually related to Airflow itself or some other stuff. Work with EFS definitely take more effort rather than GitSync.

Just for someone who might found this thread in the future with EFS performance degradation might help:

Disable save python bytecodes inside of NFS (AWS EFS) mount

  • Mount as Read-Only
  • Disable Python bytecode by set PYTHONDONTWRITEBYTECODE=x
  • Or set location for bytecodes by set PYTHONPYCACHEPREFIX for example to /tmp/pycaches

Throughput in mode Bursting in first looks like miracle but when all Bursting Capacity go to zero it could turn into your life into the hell. Each newly created EFS share has about 2.1 TB BurstingCreditBalance.

What could be done here:

  • Switch to Provisional Throughput mode permanently which might cost a lot, something like 6 USD per 1 MiB/sec without VAT, so 100 MiB/Sec would cost more than 600 USD per month.
  • Switch to Provisional Throughput mode only when BurstingCreditBalance less than some amount, like 0.5 TB, and switch back when BurstingCreditBalance exceed limit 2.1 TB. Unfortunately there is no autoscaling so it would be manual or combination of CloudWatch Alerting + AWS Lambda.

image