aws-load-balancer-controller: Target does not get registered if using readinessGate

I’ve upgraded to v1.1.6 to make use of the pod readiness gate feature to reduce 502/504’s during HPA scales. Then I’ve proceeded to update my deployment per this document.

After I updated my deployment to have the readiness gate, the first pod that had the readinessGate spec got registered by the controller just fine with following status:

status:
  conditions:
  - lastProbeTime: "2020-03-25T07:17:09Z"
    lastTransitionTime: "2020-03-25T07:13:48Z"
    status: "True"
    type: target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>

The following pod of the same deployment, had its status stuck like this:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-03-25T07:10:29Z"
    message: corresponding condition of pod readiness gate "target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>"
      does not exist.
    reason: ReadinessGatesNotReady
    status: "False"
    type: Ready

I’ve waited for a good 5 minutes to see that pod get registered into the target group, but it did not. I could reproduce this error by performing multiple rollovers for the same deployment. However if I delete the controller pod and let it restart, it recognizes these pods and registers them into the target group. After the registrations at this point, it stops to register pods again.

Is there anything wrong with the approach I took? Or is this an issue at controllers end?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 7
  • Comments: 34 (4 by maintainers)

Commits related to this issue

Most upvoted comments

@casret we plan to do a release this week. will try to include https://github.com/kubernetes-sigs/aws-alb-ingress-controller/pull/1211, and will release without it if we cannot merge before friday

Can we get a point release for this bugfix? It affects us as well.

This should be fixed on master now.

@nirnanaaa and I found the underlying issue.

When all containers in a pod have started, the pod IP appears in the NotReadyAddresses field of a corresponding Endpoints object. For this update of the Endpoints object, we trigger a reconciliation loop. However, we ignore it if not all containers in the pod are ready at this point in time (which can be the case if there is a readinessProbe defined). Then, if all containers in the pod become ready, the pod still doesn’t count as ready (because the readiness gate is not fulfilled yet) - its IP will not be moved from NotReadyAddress to Addresses - the Endpoints object doesn’t get updated - thus no reconcile is triggered again for this Endpoints object - new pods might just sit there until an another update happens on the Endpoints object (for whatever reason).

The solution to this would be to add a pod reconciler which triggers endpoints reconciliations for any associated Endpoints objects. However, this will likely have other implications for the controller (more CPU & memory requirements because it will keep all the pod objects in memory). @M00nF1sh any thoughts?

@devkid We came to the same conclusion (as seen in #1214). I confirm we’re back to fast rollouts with that change.

AFAIK we already have implicitly hooked informers on pods objects (see for instance the failures messages in #1209)