airflow: Scheduler Memory Leak in Airflow 2.0.1
Apache Airflow version: 2.0.1
Kubernetes version (if you are using kubernetes) (use kubectl version
): v1.17.4
Environment: Dev
- OS (e.g. from /etc/os-release): RHEL7
What happened:
After running fine for some time my airflow tasks got stuck in scheduled state with below error in Task Instance Details: “All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless: - The scheduler is down or under heavy load If this task instance does not start soon please contact your Airflow administrator for assistance.”
What you expected to happen:
I restarted the scheduler then it started working fine. When i checked my metrics i realized the scheduler has a memory leak and over past 4 days it has reached up to 6GB of memory utilization
In version >2.0 we don’t even have the run_duration config option to restart scheduler periodically to avoid this issue until it is resolved.
How to reproduce it: I saw this issue in multiple dev instances of mine all running Airflow 2.0.1 on kubernetes with KubernetesExecutor. Below are the configs that i changed from the default config. max_active_dag_runs_per_dag=32 parallelism=64 dag_concurrency=32 sql_Alchemy_pool_size=50 sql_Alchemy_max_overflow=30
Anything else we need to know:
The scheduler memory leaks occurs consistently in all instances i have been running. The memory utilization keeps growing for scheduler.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 11
- Comments: 67 (46 by maintainers)
Commits related to this issue
- Fixes memory leak in scheduler The FileProcessorHandler is instantiated once and assigned a FileHandler to write logs to files named based on context. Apparently there are cases where single FileHan... — committed to potiuk/airflow by potiuk 3 years ago
- Advises the kernel to not cache log files generated by Airflow Extends the standard python logging.FileHandler with advise to the Kernel to not cache the file in PageCache when it is written. While t... — committed to potiuk/airflow by potiuk 3 years ago
- Advises the kernel to not cache log files generated by Airflow (#18054) * Advises the kernel to not cache log files generated by Airflow Extends the standard python logging.FileHandler with advise... — committed to apache/airflow by potiuk 3 years ago
- Advises the kernel to not cache log files generated by Airflow (#18054) * Advises the kernel to not cache log files generated by Airflow Extends the standard python logging.FileHandler with advise t... — committed to apache/airflow by potiuk 3 years ago
@potiuk the last fix works as it should
@potiuk
I use helm 1.6.0 and airflow 2.2.5
why memory continuou increase? both shceduler and triggerer not webserver
Please keep this thread on topic with the scheduler memory issue. For usage questions, please open threads in Discussions instead.
i have same problem. i looked for all the channels and methods but did not solve it! Thanks for your question and let me know what the problem is. I have to use crontab script to restart scheduler process regularly. But this is stupid. If can’t solve it, I can only fall back to 1.10.
What kind of memory is it ? See the whole thread. There is different kind of memory and to might be observing cache memry growth for whatever reason.
Depending on the type of memory it might or might not be a problem. Buy you need gto investigate it in detail. No one is able to diagnose it without you investigating based on three thread.
The thread has all the relevant information. You need to see what process is leaking - whether it is airflow or system or some other process
BTW. I suggest you open a new discussion with all the details. There is little value in commenting on closed issue. Remember also this is a free forum where people help when they can and their help is much more efficient if you give all the information and show that you’ve done your part. There also companies offering help for Airflow for money and they can likely do the investigation for you.
@potiuk Ok, we will deploy to the test stand today
So I think that the file handler will be closed by the GC finalizer (and since Python uses both ref counting and a period GC to detect loops) assigning
self.handler
to something else should GC the old handler and old FH.Should. But it is entirely possible that the logging framework might have another reference to it somewhere hanging around.
@potiuk I was just planning to test a similar fix on our installation, since by examining the code I could not find where the file is being closed.
As far as I know, no optimizations were made, and the problem is not with the logs of the dag processor manager, but the problem with the logs of DagFileProcessorProcess and the default airflow.utils.log.file_processor_handler.FileProcessorHandler is used there without any modifications and custom configurations.
At the moment, I’m not sure than the problem is in the “dirty” memory,
but I will check tomorrow.
sory it was wrong, fixed
When I was digging in to a similar issue I couldn’t see the memory attributed to any particular process – only the whole container via working_set_bytes – I was testing/looking in
ps
and all of the counters I could see in/proc/<pid>/
but didn’t see any memory growth reflected in any of thoseSo that led me to believe the problem was not a traditional memory leak from the python code, but something OS related.
I did nothing but a rm and it dropped quite immediately (sorry the memory is brought back by prometheus andd you have delay but what I can tell you is that it dropped within 15s after I did the rm) I just did it on my dev instance in fact, same result The container is the scheduler, I run separate container for each service, this is the one with the command “airflow scheduler -n -1” I don’t think the scheduler did a restart by itself, if it had been the case then kubernetes would have shown a failed container service and so would have restarted it and it’s not the case I can tell you that the type of memory that is shown to grow is the one from the prometheus metric called container_memory_cache
In my produce env.
used : linux release: Debain airflow version: 2.0.1 exeucutor = CeleryExecutor max_active_dag_runs_per_dag=32 parallelism=32 dag_concurrency=16 sql_Alchemy_pool_size=16 sql_Alchemy_max_overflow=16
about 3 workers,40 dags, 1000 tasks. Many tasks keep
scheduled
status sometimes and canot keep running. when I call cron script to restart process every hour.The problem haved sloved.not completely.I use
CeleryExecutor
in the project also have this problem After schedule, VSZ and RSS keep growing always.