ingress-nginx: Terminating pod causes network timeouts with GCP L4 LB and externalTrafficPolicy: Local

NGINX Ingress controller version: 0.34.1

Kubernetes version (use kubectl version): v1.16.13-gke.401

Environment:

Cloud provider or hardware configuration: GKE
Load Balancer: GCP TCP Load Balancer
Extra Options: service.type: LoadBalancer service.externalTrafficPolicy: Local lifecycle.preStop.exec.command: ["sh", "-c", "sleep 60 && /wait-shutdown"] kind: Deployment

What happened:

Incoming connections to the HTTP/HTTPS ports from the Load Balancer start timing out immediately upon the start of the termination process if the pod is the last copy on a Node. This results in downscaling using the HPA to periodically cause almost 30 seconds of service disruption, as the Load Balancer continues to send traffic during that time until it is removed as a backend to the LB due to failing health checks.

What you expected to happen:

Regardless of whether externalTrafficPolicy is Local or Cluster, the preStop hook should be honored so that the Load Balancer has time to remove the empty Node.

This appears to be a result of the same issue causing https://github.com/kubernetes/kubernetes/issues/85643. Inside the Node, the HTTP(S) NodePort continues to work correctly during the termination process until the app actually stops, at which point the NGINX endpoint has been safely removed as an Endpoint. However, the moment the termination process begins, that port is closed to external traffic, meaning that there is no grace period for the Load Balancer to remove the Node from its backend pool and any traffic sent to the Node is silently lost.

How to reproduce it:

Deploy ingress-nginx in GKE using externalTrafficPolicy: Local and a delaying preStop hook.
Run no more than one NGINX pod per Node.
Send traffic directly to the HTTP NodePort on each Node and observe that it reaches the Default Backend.
Remove one pod from the Deployment.
Observe that, immediately, the HTTP NodePort is closed to outside traffic despite the pod itself continuing to run NGINX.

Anything else we need to know: I’m trying to find any solution around this problem (however hacky) that preserves the remote IP and, ideally, allows me to continue using ingress-nginx.

Already attempted:

Using service.type: Both to ensure at least one NGINX pod on every Node through the DaemonSet – this helps a bit, but, in addition to overprovisioning the service, has its own problems around the Node itself being terminated.
Disabling HPA with a higher baseline – this eliminates the risk of scaling, and thus prevents the issue from occurring, but leaves us at risk of both overprovisioning and not being able to handle an unexpected spike properly.
Running with externalTrafficPolicy: Cluster – I could find no configuration in GKE where this preserved the remote IP, which is a hard requirement.
Switching to a different Ingress Controller – so far, it looks like the only option which would avoid this problem would be the GCE Ingress and an L7 GCP LB. However, I currently rely on NGINX features that the GCE Ingress does not support.

/kind bug

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 4
Comments: 17 (7 by maintainers)

Most upvoted comments

This is easy to replicate with the custom preStop hook removed. In the following example, hello.sample is a DNS record pointing at my GCP LB, going to a simple hello-world backend. There are 2 ingress-nginx pods running (on different nodes), and this is the only traffic going to that LB. I deleted one of the pods at 16:39:21.

% while : ; do curl --connect-timeout 2 -s https://hello.sample > /dev/null && echo `date` ok || echo `date` fail ; sleep 0.2 ; done

Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:20 EDT 2020 ok
Wed Oct 7 16:39:20 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:24 EDT 2020 fail
Wed Oct 7 16:39:24 EDT 2020 ok
Wed Oct 7 16:39:24 EDT 2020 ok
Wed Oct 7 16:39:25 EDT 2020 ok
Wed Oct 7 16:39:27 EDT 2020 fail
Wed Oct 7 16:39:27 EDT 2020 ok
Wed Oct 7 16:39:30 EDT 2020 fail
Wed Oct 7 16:39:30 EDT 2020 ok
Wed Oct 7 16:39:30 EDT 2020 ok
Wed Oct 7 16:39:32 EDT 2020 fail
Wed Oct 7 16:39:33 EDT 2020 ok
Wed Oct 7 16:39:33 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:37 EDT 2020 fail
Wed Oct 7 16:39:39 EDT 2020 fail
Wed Oct 7 16:39:41 EDT 2020 fail
Wed Oct 7 16:39:43 EDT 2020 fail
Wed Oct 7 16:39:45 EDT 2020 fail
Wed Oct 7 16:39:48 EDT 2020 fail
Wed Oct 7 16:39:48 EDT 2020 ok
Wed Oct 7 16:39:48 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:50 EDT 2020 ok
Wed Oct 7 16:39:50 EDT 2020 ok

The moment the termination signal arrives, requests start timing out because the LB is sending traffic to a node that has closed its NodePort. Exactly 24 seconds later, that node is finally removed from the LB and the service is healthy again.

jeisen on Oct 7, 2020

Other sources pointing to the same issue: https://medium.com/flant-com/kubernetes-graceful-shutdown-nginx-php-fpm-d5ab266963c2 https://philpearl.github.io/post/k8s_ingress/

I don’t believe this is quite describing the same problem – the scenario they describe would be solved by sleeping after receiving a SIGTERM so that the endpoint can be removed. Instead, this problem is because the external LB doesn’t work the same way as kube-proxy; we could only solve this problem either at the LB side or by somehow delaying the NodePort close.

jeisen on Oct 8, 2020