ingress-nginx: Sudden high memory usage.
NGINX Ingress controller version: Chart Revision: v4.0.3 | Chart App Version: 1.0.2 | Nginx version: 1.19.9 Kubernetes version: 1.20.2-do.0
Environment:
-
Cloud provider or hardware configuration: Digital Ocean Managed Kubernetes
-
OS (e.g. from /etc/os-release): Pod runs Alpine Linux v3.14.2
-
Kernel (e.g.
uname -a): 4.19.0-11-amd64 -
Install tools: N/A
-
Other: N/A
-
How was the ingress-nginx-controller installed:
ingress-nginx | ingress-controller | 2 | 2021-10-01 09:12:17.6727727 +0100 BST | deployed ingress-nginx-4.0.3 | 1.0.2- values.yaml:
controller:
admissionWebhooks:
enabled: false
config:
compute-full-forwarded-for: true
forwarded-for-header: CF-Connecting-IP
proxy-real-ip-cidr: <cf-cidrs>
use-forwarded-headers: true
use-proxy-protocol: true
extraArgs:
default-ssl-certificate: <default_cert_location>
extraInitContainers:
- command:
- sh
- -c
- sysctl -w net.core.somaxconn=32768; sysctl -w net.ipv4.ip_local_port_range='1024 65000'
image: alpine:3.13
name: sysctl
securityContext:
privileged: true
hostPort:
enabled: true
kind: DaemonSet
metrics:
enabled: true
serviceMonitor:
enabled: true
service:
type: ClusterIP
watchIngressWithoutClass: true
- Current State of the controller: Perfectly fine until the issue arises.
What happened: Nginx ingress pods suddenly go from ~500MB of ram usage on average (for multiple days at a time), to slowly spiking up to 1.4GB over the span of a few hours until it gets a SystemOOM killed due to a lack of available memory on the node it was assigned to.
Our connections and other metrics do not spike out of the ordinary during this time. It is just simply memory usage suddenly spiking up and then dropping. There’s no pattern to the crashing, we can go days without the issue and then it’ll happen all of a sudden. The number of days that pass are typically between a 2-9 day range. Once the issue arises, we can expect the other controllers to follow suit typically an hour or two after one another.
Our pod logs are spammed with request logs so we are unable to see if nginx itself is logging any errors. When we view our pod logs, we are usually only able to see logs in the last 1 minute, sometimes in the last 40 seconds if we were able to perfectly time the saving of logs when it happens. This makes it hard to know if nginx throws errors during the spike and the crash. Any recommendations on changing something to make this easier would be highly appreciated
Just in case this was a core dump issue (https://github.com/kubernetes/ingress-nginx/issues/6896) we upgraded to v1.0.2 (https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.0.2). The issue then showed themselves within 32 hours of the deploy, when we were at 15,000 concurrent open web sockets across 3 controller pods. We typically do not see the issue on a weekend where our peak is 15,000 it’s typically around the 10,000 mark of which is a mid-week typical peak.
Controller 1 that died:

Controller 2 & 3 that started following the same patterns:

Note: This has been discussed with @strongjz on the Kubernetes #ingress-nginx-users channel on the Slack: https://kubernetes.slack.com/archives/CANQGM8BA/p1632951733103200
What you expected to happen: Memory to not randomly spike and crash regardless of if we have 6,000, 10,000 connections or 25,000 connections.
How to reproduce it: I have no idea how we are even producing the issue itself so I am unable to give exact reproduction steps sadly.
/kind bug
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (4 by maintainers)
I’m going to save the triage robot some time and close this issue now as I haven’t came across this issue since we updated to 1.0.4 as mentioned here. If anyone has issues in the future you should make your own issue and link back to this one if you feel like it’s the same issue.
Upgrade finished just now, will let you know of the results hopefully in the next 7 days. 🤞