ingress-nginx: Nginx OOM
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): yes
What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): memory, oom
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
NGINX Ingress controller version: 0.20.0 and 0.19.0
Kubernetes version (use kubectl version
): v1.11.4
Environment:
- Cloud provider or hardware configuration: 128 GB RAM; 20 CPUs
- OS (e.g. from /etc/os-release): RHEL 7.5
- Kernel (e.g.
uname -a
): 3.10.0 - Install tools:
- Others:
What happened: The memory of Nginx increases slowly and then suddenly increases rapidly until the Pod gets OOM killed by Kubernetes. In the Pod I can see that the memory is used by nginx itself.
Here the memory graph of one Pod over the last 24h:
All pods last 24h:
In the log of one of the crashed Pods I see this:
2018/10/28 23:44:13 [alert] 63#63: worker process 96931 exited on signal 9
2018/10/28 23:44:14 [alert] 63#63: worker process 96964 exited on signal 9
2018/10/28 23:44:14 [alert] 63#63: worker process 96997 exited on signal 9
2018/10/28 23:44:14 [alert] 63#63: worker process 96998 exited on signal 9
2018/10/28 23:44:14 [alert] 63#63: worker process 97064 exited on signal 9
2018/10/28 23:44:15 [error] 97065#97065: *754462 lua entry thread aborted: memory allocation error: not enough memory
stack traceback:
coroutine 0:
[C]: in function 'ffi_str'
/usr/local/lib/lua/resty/core/shdict.lua:242: in function 'get_backends_data'
/etc/nginx/lua/balancer.lua:100: in function </etc/nginx/lua/balancer.lua:99>, context: ngx.timer
2018/10/28 23:44:15 [alert] 63#63: worker process 97131 exited on signal 9
2018/10/28 23:44:15 [alert] 63#63: worker process 97065 exited on signal 9
2018/10/28 23:44:16 [alert] 63#63: worker process 97229 exited on signal 9
2018/10/28 23:44:16 [alert] 63#63: worker process 97164 exited on signal 9
2018/10/28 23:44:16 [alert] 63#63: worker process 97295 exited on signal 9
2018/10/28 23:44:16 [alert] 63#63: worker process 97130 exited on signal 9
2018/10/28 23:44:17 [alert] 63#63: worker process 97262 exited on signal 9
With version 0.17.1 in the same cluster and dynamic config we had no such problems. 0.19.0 showed the same OOM behavior. I didn’t try 0.18.
What you expected to happen: No OOMs
How to reproduce it (as minimally and precisely as possible): I don’t know what causes it, doesn’t happen on other, smaller clusters in our env. On the affected cluster we have 1121 Ingresses, 1384 Services and 1808 Pods.
Anything else we need to know: Nginx Config Map:
upstream-max-fails: "3"
upstream-fail-timeout: "5"
proxy-read-timeout: "300"
proxy-send-timeout: "300"
proxy-connect-timeout: "10"
use-gzip: "false"
client-body-buffer-size: "64k"
server-tokens: "false"
proxy-body-size: "20m"
client_max_body_size: "20m"
worker-shutdown-timeout: "300s"
worker-processes: "4"
Controller flags:
containers:
- args:
- /nginx-ingress-controller
- --default-ssl-certificate=$(POD_NAMESPACE)/ingress-cert
- --configmap=$(POD_NAMESPACE)/nginx-ingress-conf
- --enable-ssl-chain-completion=false
- --update-status=false
- --publish-service=$(POD_NAMESPACE)/nginx-ingress
How can I debug this? Can I somehow see how much memory the lua module uses?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 5
- Comments: 39 (21 by maintainers)
Please update to 0.32.0. You are using a version released on Feb 27, 2019. There are lot of fixes related to reloads and multiple NGINX updates.
After about 5 minutes of constant load (~9000 req/sec) 0.20.0 starts leaking memory at speed of about 1-1.5 GB/hour.
After a day of such load nginx-ingress-controller process successfully consumes all remaining RAM and gets killed and restarted by Kubernetes. My Nginx ingress, upstream server and load runner are all 4 CPU, 32 GB machines.
--enable-dynamic-configuration=false
completely heals the leak (RAM is stable at about 200 MB), but ingress metrics are gone as well (see #3053).If you are happy to sacrifice the metrics, don’t forget to raise
error_log_level
toerror
to not pollute logs with warnings from Lua monitor script.