ingress-gce: Unable to avoid unhealthy backend / 502s on rolling deployments

I have a GCE ingress in front of an HPA-managed deployment (at this time, with a single replica).

On a rolling deployment, I sometimes run into the backend being marked as unhealthy and resulting 502 errors, usually for about 15-20 seconds.

According to the pod events, the neg-readiness-reflector appears to mark cloud.google.com/load-balancer-neg-ready to True for the pod before it is actually ready:

Normal   LoadBalancerNegNotReady            18m                neg-readiness-reflector                Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-600f13cf-default-my-svc-8080-f82bf741]
Normal   LoadBalancerNegWithoutHealthCheck  16m                neg-readiness-reflector                Pod is in NEG "Key{\"k8s1-600f13cf-default-my-svc-8080-f82bf741\", zone: \"europe-west1-c\"}". NEG is not attached to any Backend Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.
Warning  Unhealthy                          16m                kubelet                                Readiness probe failed: Get "http://10.129.128.130:8080/healthz": dial tcp 10.129.128.130:8080: connect: connection refused

While in this state, the previous pod terminates, but the load balancer does not route requests to the new pod, resulting in 502s.

I do have the deployment strategy set that should not allow this but I guess the neg being set to Ready is subverting this:

  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

My deployment does also define a readiness probe as can be seen in the events above.

I do also have a health check configuration defined for the backend:

apiVersion: v1
kind: Service
metadata:
  name: my-svc
  labels:
    app.kubernetes.io/name: mysvc
  annotations:
    cloud.google.com/backend-config: '{"ports": {"8080":"my-backendconfig"}}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: mysvc
  ports:
    - port: 8080
      protocol: TCP
      targetPort: 8080
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: my-backendconfig
spec:
  timeoutSec: 45
  connectionDraining:
    drainingTimeoutSec: 0
  healthCheck:
    checkIntervalSec: 5
    timeoutSec: 5
    healthyThreshold: 1
    unhealthyThreshold: 2
    type: HTTP
    requestPath: /healthz
    port: 8080

I found this stackoverflow in which the user works around the issue with delaying the pod stop with a sleep on the lifecycle.preStop, but that seems more like a hack than a proper solution to this issue: https://stackoverflow.com/questions/71127572/neg-is-not-attached-to-any-backendservice-with-health-checking.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 11
  • Comments: 35 (6 by maintainers)

Most upvoted comments

This can potentially be of relevance:

https://cloud.google.com/kubernetes-engine/docs/troubleshooting/troubleshoot-load-balancing#500-series-errors

It explains what might be happening and how to tackle it.

While it’s a great summary and shows visually what is happening, there is nothing new in it that was not already discussed above in this issue – even my original post mentions the preStop sleep workaround/hack.

Also see comment https://github.com/kubernetes/ingress-gce/issues/1718#issuecomment-1140097570 and the following comments with a solution for the second problem that can cause 500s related to zones. I think this information should also be added to that article.

On a managed platform, I don’t think users in general should be configuring application-level resources with things like preStop hooks where the sole reason is to solve a problem with how the platform operates.

We have minReadySeconds set and are using container-native load balancing, and we still see 502s every now and then.

You may find this GCP documentation helpful, as it describes the problems here are some possible solutions: https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#traffic_does_not_reach_endpoints

I’m also getting 502 errors during availability zone changes. I added the cloud.google.com/neg: '{"ingress":false}' annotation to my service so that the load balancer forwards to all instance groups in the cluster and I think it solves the problem relatively cleanly.

Are there drawbacks to this workaround? Any reason it wasn’t suggested in this thread before?

So I believe the following is what is happening:

  1. The load balancer is hitting the old pod even though it is being shut down (using lifecycle.preStop does seem to workaround this issue). This is consistent and easily reproduced.

Neg controller does immediately respond to endpoint changes. However, there can be latency due to the time it takes for a detach operation to complete. In the time it takes the detach operation to complete, the load balancer may still route traffic to the terminating pod. There are a few options to mitigate this:

  1. The lifecycle.preStop as you are already using.
  2. use terminationGracePeriodSeconds so that your application will continue to accept traffic for a little longer.
  3. Adjust health checks to be sensitive enough to recognize that the endpoint is gone. This option will only reduce the frequency of 502s but won’t necessarily eliminate them.

Since you are already doing (1), that is probably the easiest approach.

2.The load balancer is hitting the new pod even though it is not ready yet due to the “NEG is not attached to any Backend Service with health checking.” error. This is intermittent and more difficult to reproduce, but not uncommon.

This is a race between the Ingress controller and a workload being scheduled/started on a new node. The NEG controller sees the update before the Ingress controller and adds the endpoint. However if this is on a node in a new zone, the NEG controller creates a new NEG in the new zone. The Ingress controller then needs to attach that NEG to the backend service. However if the Ingress controller doesn’t finish that before the workloads are scheduled on the node, those new pods will have their readiness gates switched to ready immediately since the NEG will not be in any BackendService.

For non auto-pilot clusters our recommendation is to reduce the number of zone changes and try to run workloads in every zone the cluster is in. Since this is an autopilot cluster, your options may be limited in this regard.

At this time we are still looking into how to make the experience better for both of these cases.

Indeed, but this is not a stop problem, it is a start problem, kube starts to delete the old pod while the gclb has not attached the new pod to the NEG. It takes a random time to auto fix, so 60 seconds is not always sufficient.

There is obviously a problem in the ingress management. I’m praying each time I deploy, kubernetes just brings me downtime 💢 .

Can’t Google estimate the rate of occurences of LoadBalancerNegWithoutHealthCheck in their gke ?

@swetharepakula I understand but I’d rather have slightly worse performance with the double hop than downtime with NEG unless I’m missing some other considerations.

The workaround I suggested with terminationGracePeriodSeconds is not the documented way to use the field. Typically it gives pod more time to shutdown so it can gracefully exist as you describe it. To use the workaround I suggested, you would have to modify the application shutdown logic to not immediately start shutting but continue accepting requests but failing the healthcheck for a period of time. So your terminationGracePeriodSeconds would timeToShutdown + GCE Programming latency and your application would have to wait that GCE Programming Latency before beginning shutdown operations. In comparison the lifecycle.preStop is probably easier configure.

For now we will keep this issue open to communicate updates on this front. There isn’t another issue open to track this work.