airflow: Airflow web UI is slow
Apache Airflow version:
1.10.10
Kubernetes version (if you are using kubernetes) (use kubectl version
):
1.13.12
Environment:
- Cloud provider or hardware configuration: Azure
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
What happened:
Every HTTP requests of the UI takes at least 5s, even static content.
/admin/metrics/
, /health
endpoints and 404 page have the same problems.
Here a graphs showing CPU usage of all ariflow components:
Each container has a 1s limit (left Y axis) so none of them is currently CPU bound.
What you expected to happen:
How to reproduce it:
Anything else we need to know:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 26 (6 by maintainers)
I fixed this by changing the default worker gunicorn from sync to an asynchronous class namely gevent. Please see the below AWS thread for the weird behaviour between gunicorn and ELB’s.
https://forums.aws.amazon.com/thread.jspa?messageID=419138
So, simply set AIRFLOW__WEBSERVER__WORKER_CLASS: “gevent” in your config, should be better
I had a similar issue with airflow running on kubernetes cluster that I was, fortunately, able to solve.
Regular HTTP connections were taking a minimal of 5 seconds to complete. Even using curl to fetch static content, like a CSS file, was taking 5 seconds. When looking at the logs of the airflow web process it didn’t show anything. Although following them ‘realtime’ showed that the GET was showing up in the logs with the same 5 second delays as the curl command was taking.
by-passing everything and running curl directly from the container, and thus by-passing all the kubernetes networking stuff, was still having the delays.
It took me a few days to realize what was going on with my setup.
As it turned out it was due to the ‘type: LoadBalancer’ service I was using to expose the airflow webserver to outside the cluster. The loadbalancer was a external network load balancer that connected to the service via NodePort on each virtual machine in the node. For whatever reason this meant that there was a large number of connections just kept open to the webserver at any time.
In a 20 node cluster this meant 20 connections.
So when I killed the LoadBalancer service and started using nginx-ingress instead then the problem instantly resolved itself. No more delays. Admin web UI went back to normal.
I am not exactly sure what was going on here. But I suspect that having a large number of connections always open was causing gunicorn process to delay routing new connections to the pool of webserver worker processes. I was only using 4 processes at the time.
So if you are seeing these strange 5 second delays then use netstat or similar tool to count the number of “ESTABLISHED” connections to the webserver process. If you have a lot of connections and you are using service ‘type: LoadBalancer’ then try switching to using a ingress controller. Also increasing the number of worker processes to exceed the number of established connections will probably work too.
Hope that helps.
It’s always DNS.
Even when you think it’s not DNS and you check for DNS problems, it’s always DNS problems.
@isabo-lululemon you may need to install gevent if you didn’t already
pip install gevent
I am also hitting the exact same issue? do we have an update on this issue? @ashb @sylr
FWIW: I encountered a similar issue lately which could be traced to slow HTTP responses when requesting resources
/dag_stats
and/task_stats
. The culprit is the database query which under certain circumstances (namely large number of rows and large number of active DAGs) it shows a) long response time and b) low susceptibility to optimization. This situation has a bad fix though: DB query ops are done through the ORM which in turn makes a poor job writing SQL code under the circumstances above.I think in my case it also was DNS issue since when I use the IP instead of the URL the UI become very fast
Honestly it looks like some kind of connectivity/configuration problem of your browser/system configuration (and it’s very similar with the original request of @sylr). If you look at @sylr screenshot - not only static code there takes ~5s to load but it also takes as long when those are “cached” already. Also, if you look at the times they are all very similar to each other and for me it looks like this is in an issue with either browser or something on the way - but not Airflow itself.
I’d say you should look at some of your deployment configuration.
In order to verify it, just take a look at logs of the webserver. By default it should show all the requests it gets and how long it took to process them.
The 5/10 seconds delays seem indeed suspiciously like DNS timeout. One thing that COULD happen is indeed DNS timeout if the server or your proxty is trying to run reverse-DNS query on your IP address. I’ve seen this happening when the server was trying to put names against the IP addresses in the logs but the DNS hanged with timeout on the reverse DNS call. Disabling the IP-> name mapping in the place it’s being run should help.
Quoting @ashb : “It’s DNS. It’s always DNS.”