aws-load-balancer-controller: Getting 502/504 with Pod Readiness Gates during rolling updates

I’m making use of the Pod Readiness Gate on Kubernetes Deployments running Golang-based APIs. The goal is to achieve full zero downtime deployments.

During a rolling update of the Kubernetes Deployment, I’m getting 502/504 responses from these APIs. This did not happen when setting target-type: instance.

I believe the problem is that AWS does not drain the pod from the LB before Kubernetes terminates it

Timeline of events:

  1. Perform a rolling update on the deployment (1 replica)
  2. A second pod is created in the deployment
  3. AWS registers a second target in the Load Balancing Target Group
  4. Both pods begin receiving traffic
  5. I’m not sure what happens first at this point: a. AWS begins de-registered/drained the target b. Kubernetes begins terminating the pod
  6. Traffic sent to the deployment begins receiving 502 and 504 errors
  7. The old pod is deleted
  8. Traffic returns to normal (200)
  9. The target is de-registered/drained (depending on delay)

This is tested with a looping curl command:

while true; do
  curl --write-out '%{url_effective} - %{http_code} -' --silent --output /dev/null -L https://example.com | pv -N "$(date +"%T")" -t
  sleep 1
done

Results:

https://example.com - 200 - 13:04:16: 0:00:00
https://example.com - 502 - 13:04:17: 0:00:01
https://example.com - 200 - 13:04:20: 0:00:00
https://example.com - 504 - 13:04:31: 0:00:10
https://example.com - 200 - 13:04:32: 0:00:00
https://example.com - 200 - 13:04:33: 0:00:00
https://example.com - 200 - 13:04:34: 0:00:00
https://example.com - 200 - 13:04:35: 0:00:00
https://example.com - 200 - 13:04:36: 0:00:00

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 47
  • Comments: 28 (7 by maintainers)

Most upvoted comments

What’s the protocol for getting this prioritized? We’ve hit it as well. This is a serious issue and while I understand there’s a workaround (hack), it’s certainly reducing my confidence in running production workloads on this thing.

How about abusing (?) validationAdmissionWebHook for delaying pod deletion? Here’s the sketch of the idea:

  1. ValidataionAdmissionWebhook intercepts pod deletion. It won’t allow deletion of the pod if the pod is is reachable from the alb, ip type ingress first.
  2. However, it patches the pod. It removes labels and ownerReferences so it is removed from ReplicationSet and Endpoint. Also ELB starts draining since it is removed from Endpoint.
  3. After some time passes, and ELB finishes its draining, the pod is deleted by aws-load-balancer-controller.

edit: I’ve implemented this idea into a chart here. https://github.com/foriequal0/pod-graceful-drain

This is still a serious issue, any update on it? We use currently the solution from @foriequal0 which is really doing a great job so far. I wish this would be officially handled by the controller project itself.

Bumping this issue. Adding sleep() does not sound professional, it’s a workaround and only workaround 😕

I thought it might be useful to share this KubeCon talk, “The Gotchas of Zero-Downtime Traffic /w Kubernetes”, where the speaker goes into the strategies required for zero-downtime rolling updates with Kubernetes deployments (at least as of 2022):

https://www.youtube.com/watch?v=0o5C12kzEDI

It can be a bit hard to conceptualise the limitations of the async nature of Ingress/Endpoint objects and Pod termination, so I found the above talk (and live demo) helped a lot.

Hopefully it’s useful for others.

@AirbornePorcine in my own test, the sum of controller process time(from pod kill to target deregistered) and ELB API propagation time(from deregister API call to targets actually removed from ELB dataplane) takes less than 10 second.

And the PreStop hook sleep only need to be controller process time + ELB API propagation time + HTTP req/resp RTT.

Just asked ELB team whether they have p90/p99 metrics available for ELB API propagation time. If so, we recommend a safe PreStop sleep.

Ok, so, we just did some additional testing on that sleep timing.

The only way we’ve been able to get zero 502s during a rolling deploy, is to set our preStop sleep to the target group’s deregistration delay + at least 5s. It seems almost like there’s no way to guarantee that AWS isn’t actually sending you new requests, until the target is fully removed from the target group, and not just marked “draining”.

Looking back in my emails, I realized this is exactly what AWS support had previously told us to do - don’t stop the target from processing requests until the target group deregistration delay has elapsed at minimum (we added the 5s to account for the controller process and propagation time as you mentioned).

Next week we’ll try tweaking our deregistration delay and see if the same holds true (it’s currently 60s, but we really don’t want to sleep that long if we can avoid it)

Something you might want to try though @calvinbui!

For clusters using traefik proxy as ingress it might be worth looking also into the entrypoint lifecycle feature to control graceful shutdowns https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle. At least in this case it avoids the need for the sleep workaround 😃

We’ve been having the same issue. We confirmed with AWS that there is some propagation time between when some target is marked draining in a target group, and when that target actually stops receiving new connections. So, at the suggestion of other issues I’ve seen in the old project for this, we added a 20s sleep in a preStop script. This hasn’t entirely eliminated them though, they still happen on deployment, just not with as much volume. Following this to see if anyone else has any good ideas, as troubleshooting these 502s has been infuriatingly difficult.