ingress-nginx: Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

NGINX Ingress controller version: 0.20.0

Kubernetes version (use kubectl version): 1.9.6

Environment: Production

  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): Linux k8s-master-81594228-1 4.13.0-1012-azure #15-Ubuntu SMP Thu Mar 8 10:47:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: acs engine, ansible, terraform
  • Others:

What happened: After upgrading the Nginx controller version from 0.15.0 to 0.20.0, the nginx ingress controller pods are regularly crashing after several timeout messages on the liveness probe. The nginx ingress controller pods are installed on separate VMs as all other pods. We need 0.20.0 version because we want to activate use-forwarded-headers: “false” in nginx config map to avoid the security vulnerability (user forging the headers to bypass the whitelist of nginx).

What you expected to happen: Stable behavior of nginx ingress controller pods as in version 0.15.0.

How to reproduce it (as minimally and precisely as possible): Update the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.15.0 to quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0 on existent nginx ingress controller deployment.

Anything else we need to know: Logs from the events: 2018-11-21 17:24:25 +0100 CET 2018-11-21 17:23:05 +0100 CET 6 nginx-ingress-controller-7d47db4569-9bxtz.1569303cf3aebbba Pod spec.containers{nginx-ingress-controller} Warning Unhealthy kubelet, k8s-dmz-81594228-0 Liveness probe failed: Get http://xx.xx.xx.xx:10254/healthz:: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2018-11-21 17:24:26 +0100 CET 2018-11-21 17:24:26 +0100 CET 1 nginx-ingress-controller-7d47db4569-9bxtz.1569304fae92655c Pod spec.containers{nginx-ingress-controller} Normal Killing kubelet, k8s-dmz-81594228-0 Killing container with id docker://nginx-ingress-controller:Container failed liveness probe… Container will be killed and recreated.

We have tried to increase the timeoutSeconds on liveness probe to 4s and also to add - --enable-dynamic-configuration=false in the nginx deployment. With this configuration, the number of timeouts decreased, but after a certain charge from apps on the platform, the timeouts become more regular.

Logs from nginx pods in debug mode and timeout 3sec: {“log”:“E1121 13:30:34.808413 5 controller.go:232] Error getting ConfigMap "kube-system/udp-services": no object matching key "kube-system/udp-services" in local store\n”,“stream”:“stderr”,“time”:“2018-11-21T13:30:34.818557076Z”} {“log”:“I1121 13:30:37.500168 5 main.go:158] Received SIGTERM, shutting down\n”,“stream”:“stderr”,“time”:“2018-11-21T13:30:37.501123038Z”} {“log”:“I1121 13:30:37.500229 5 nginx.go:340] Shutting down controller queues\n”,“stream”:“stderr”,“time”:“2018-11-21T13:30:37.501167238Z”} {“log”:“I1121 13:30:37.500276 5 nginx.go:348] Stopping NGINX process\n”,“stream”:“stderr”,“time”:“2018-11-21T13:30:37.501203538Z”}

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 3
  • Comments: 28 (22 by maintainers)

Most upvoted comments

To those affected by this issue:

Please help us to test a fix for this with https://github.com/kubernetes/ingress-nginx/pull/3684 using the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

The mentioned PR contains a refactoring of the nginx server used for health-check and Lua configuration replacing the TCP port with a unix socket.

@aledbf Thanks I will start investigating now, on Friday we could resolve the issue by increasing the number of replicas. It started happening again. Will test this image.

@Globegitter maybe you can help us to test this running some scripts 😃

  1. Get information about the running pods:
kubectl get pod -o wide -n ingress-nginx
  1. SSH to a node in the cluster

  2. Build a list with the IP addresses and run a script

POD_IPS="XX.XXX.XXX XX.XXX.XXX" # using the IPs from kubectl
while true;do
  sleep 1
  for x in $POD_IPS;do
    echo "[$(date)] http://$x:10254/healthz http status:" $(curl -I -o /dev/null -s -w "%{http_code}::%{time_namelookup}::%{time_connect}::%{time_starttransfer}::%{time_total}\n" http://$x:10254/healthz);
  done
done 

This will print something like


[Fri Jan 18 15:47:33 UTC 2019] http://100.96.9.248:10254/healthz http status: 200::0.000029::0.000112::0.001546::0.001582
[Fri Jan 18 15:47:33 UTC 2019] http://100.96.3.200:10254/healthz http status: 200::0.000023::0.000817::0.002746::0.002782

This can help us to test what @ElvinEfendi said in the previous comment

Edit: maybe I should put this in a k8s Job to help to debug this issue.