ingress-nginx: Sudden high memory usage.

NGINX Ingress controller version: Chart Revision: v4.0.3 | Chart App Version: 1.0.2 | Nginx version: 1.19.9 Kubernetes version: 1.20.2-do.0

Environment:

  • Cloud provider or hardware configuration: Digital Ocean Managed Kubernetes

  • OS (e.g. from /etc/os-release): Pod runs Alpine Linux v3.14.2

  • Kernel (e.g. uname -a): 4.19.0-11-amd64

  • Install tools: N/A

  • Other: N/A

  • How was the ingress-nginx-controller installed:

    • ingress-nginx | ingress-controller | 2 | 2021-10-01 09:12:17.6727727 +0100 BST | deployed ingress-nginx-4.0.3 | 1.0.2
    • values.yaml:
controller:
  admissionWebhooks:
    enabled: false
  config:
    compute-full-forwarded-for: true
    forwarded-for-header: CF-Connecting-IP
    proxy-real-ip-cidr: <cf-cidrs>
    use-forwarded-headers: true
    use-proxy-protocol: true
  extraArgs:
    default-ssl-certificate: <default_cert_location>
  extraInitContainers:
  - command:
    - sh
    - -c
    - sysctl -w net.core.somaxconn=32768; sysctl -w net.ipv4.ip_local_port_range='1024 65000'
    image: alpine:3.13
    name: sysctl
    securityContext:
      privileged: true
  hostPort:
    enabled: true
  kind: DaemonSet
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  service:
    type: ClusterIP
  watchIngressWithoutClass: true
  • Current State of the controller: Perfectly fine until the issue arises.

What happened: Nginx ingress pods suddenly go from ~500MB of ram usage on average (for multiple days at a time), to slowly spiking up to 1.4GB over the span of a few hours until it gets a SystemOOM killed due to a lack of available memory on the node it was assigned to.

Our connections and other metrics do not spike out of the ordinary during this time. It is just simply memory usage suddenly spiking up and then dropping. There’s no pattern to the crashing, we can go days without the issue and then it’ll happen all of a sudden. The number of days that pass are typically between a 2-9 day range. Once the issue arises, we can expect the other controllers to follow suit typically an hour or two after one another.

Our pod logs are spammed with request logs so we are unable to see if nginx itself is logging any errors. When we view our pod logs, we are usually only able to see logs in the last 1 minute, sometimes in the last 40 seconds if we were able to perfectly time the saving of logs when it happens. This makes it hard to know if nginx throws errors during the spike and the crash. Any recommendations on changing something to make this easier would be highly appreciated

Just in case this was a core dump issue (https://github.com/kubernetes/ingress-nginx/issues/6896) we upgraded to v1.0.2 (https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.0.2). The issue then showed themselves within 32 hours of the deploy, when we were at 15,000 concurrent open web sockets across 3 controller pods. We typically do not see the issue on a weekend where our peak is 15,000 it’s typically around the 10,000 mark of which is a mid-week typical peak.

Controller 1 that died: image

Controller 2 & 3 that started following the same patterns: image image

Note: This has been discussed with @strongjz on the Kubernetes #ingress-nginx-users channel on the Slack: https://kubernetes.slack.com/archives/CANQGM8BA/p1632951733103200

What you expected to happen: Memory to not randomly spike and crash regardless of if we have 6,000, 10,000 connections or 25,000 connections.

How to reproduce it: I have no idea how we are even producing the issue itself so I am unable to give exact reproduction steps sadly.

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (4 by maintainers)

Most upvoted comments

I’m going to save the triage robot some time and close this issue now as I haven’t came across this issue since we updated to 1.0.4 as mentioned here. If anyone has issues in the future you should make your own issue and link back to this one if you feel like it’s the same issue.

Upgrade finished just now, will let you know of the results hopefully in the next 7 days. 🤞