ingress-nginx: Nginx OOM

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): yes

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): memory, oom


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.20.0 and 0.19.0

Kubernetes version (use kubectl version): v1.11.4

Environment:

  • Cloud provider or hardware configuration: 128 GB RAM; 20 CPUs
  • OS (e.g. from /etc/os-release): RHEL 7.5
  • Kernel (e.g. uname -a): 3.10.0
  • Install tools:
  • Others:

What happened: The memory of Nginx increases slowly and then suddenly increases rapidly until the Pod gets OOM killed by Kubernetes. In the Pod I can see that the memory is used by nginx itself.

Here the memory graph of one Pod over the last 24h: image

All pods last 24h: image

In the log of one of the crashed Pods I see this:

2018/10/28 23:44:13 [alert] 63#63: worker process 96931 exited on signal 9
2018/10/28 23:44:14 [alert] 63#63: worker process 96964 exited on signal 9
2018/10/28 23:44:14 [alert] 63#63: worker process 96997 exited on signal 9
2018/10/28 23:44:14 [alert] 63#63: worker process 96998 exited on signal 9
2018/10/28 23:44:14 [alert] 63#63: worker process 97064 exited on signal 9
2018/10/28 23:44:15 [error] 97065#97065: *754462 lua entry thread aborted: memory allocation error: not enough memory
stack traceback:
coroutine 0:
        [C]: in function 'ffi_str'
        /usr/local/lib/lua/resty/core/shdict.lua:242: in function 'get_backends_data'
        /etc/nginx/lua/balancer.lua:100: in function </etc/nginx/lua/balancer.lua:99>, context: ngx.timer
2018/10/28 23:44:15 [alert] 63#63: worker process 97131 exited on signal 9
2018/10/28 23:44:15 [alert] 63#63: worker process 97065 exited on signal 9
2018/10/28 23:44:16 [alert] 63#63: worker process 97229 exited on signal 9
2018/10/28 23:44:16 [alert] 63#63: worker process 97164 exited on signal 9
2018/10/28 23:44:16 [alert] 63#63: worker process 97295 exited on signal 9
2018/10/28 23:44:16 [alert] 63#63: worker process 97130 exited on signal 9
2018/10/28 23:44:17 [alert] 63#63: worker process 97262 exited on signal 9

With version 0.17.1 in the same cluster and dynamic config we had no such problems. 0.19.0 showed the same OOM behavior. I didn’t try 0.18.

What you expected to happen: No OOMs

How to reproduce it (as minimally and precisely as possible): I don’t know what causes it, doesn’t happen on other, smaller clusters in our env. On the affected cluster we have 1121 Ingresses, 1384 Services and 1808 Pods.

Anything else we need to know: Nginx Config Map:

  upstream-max-fails: "3"
  upstream-fail-timeout: "5"
  proxy-read-timeout: "300"
  proxy-send-timeout: "300"
  proxy-connect-timeout: "10"
  use-gzip: "false"
  client-body-buffer-size: "64k"
  server-tokens: "false"
  proxy-body-size: "20m"
  client_max_body_size: "20m"
  worker-shutdown-timeout: "300s"
  worker-processes: "4"

Controller flags:

      containers:
      - args:
        - /nginx-ingress-controller
        - --default-ssl-certificate=$(POD_NAMESPACE)/ingress-cert
        - --configmap=$(POD_NAMESPACE)/nginx-ingress-conf
        - --enable-ssl-chain-completion=false
        - --update-status=false
        - --publish-service=$(POD_NAMESPACE)/nginx-ingress

How can I debug this? Can I somehow see how much memory the lua module uses?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 5
  • Comments: 39 (21 by maintainers)

Most upvoted comments

We use nginx-ingress-controller 0.23.4

Please update to 0.32.0. You are using a version released on Feb 27, 2019. There are lot of fixes related to reloads and multiple NGINX updates.

After about 5 minutes of constant load (~9000 req/sec) 0.20.0 starts leaking memory at speed of about 1-1.5 GB/hour.

After a day of such load nginx-ingress-controller process successfully consumes all remaining RAM and gets killed and restarted by Kubernetes. My Nginx ingress, upstream server and load runner are all 4 CPU, 32 GB machines.

--enable-dynamic-configuration=false completely heals the leak (RAM is stable at about 200 MB), but ingress metrics are gone as well (see #3053).

If you are happy to sacrifice the metrics, don’t forget to raise error_log_level to error to not pollute logs with warnings from Lua monitor script.