aws-load-balancer-controller: Target does not get registered if using readinessGate

I’ve upgraded to v1.1.6 to make use of the pod readiness gate feature to reduce 502/504’s during HPA scales. Then I’ve proceeded to update my deployment per this document.

After I updated my deployment to have the readiness gate, the first pod that had the readinessGate spec got registered by the controller just fine with following status:

status:
  conditions:
  - lastProbeTime: "2020-03-25T07:17:09Z"
    lastTransitionTime: "2020-03-25T07:13:48Z"
    status: "True"
    type: target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>

The following pod of the same deployment, had its status stuck like this:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-03-25T07:10:29Z"
    message: corresponding condition of pod readiness gate "target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>"
      does not exist.
    reason: ReadinessGatesNotReady
    status: "False"
    type: Ready

I’ve waited for a good 5 minutes to see that pod get registered into the target group, but it did not. I could reproduce this error by performing multiple rollovers for the same deployment. However if I delete the controller pod and let it restart, it recognizes these pods and registers them into the target group. After the registrations at this point, it stops to register pods again.

Is there anything wrong with the approach I took? Or is this an issue at controllers end?

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 7
Comments: 34 (4 by maintainers)

Commits related to this issue

React on pod events for readiness gates We're relying on endpoints events to re-trigger reconciliations during rollouts, and we're considering pod's containers status (eg. are all pod's containers Co... — committed to DataDog/aws-alb-ingress-controller by bpineau 4 years ago
React on pod events for readiness gates We're relying on endpoints events to re-trigger reconciliations during rollouts, and we're considering pod's containers status (eg. are all pod's containers Co... — committed to alebedev87/aws-load-balancer-controller by bpineau 4 years ago

Most upvoted comments

@casret we plan to do a release this week. will try to include https://github.com/kubernetes-sigs/aws-alb-ingress-controller/pull/1211, and will release without it if we cannot merge before friday

M00nF1sh on Apr 14, 2020

Can we get a point release for this bugfix? It affects us as well.

casret on Apr 13, 2020

This should be fixed on master now.

alfredkrohmer on Apr 11, 2020

@nirnanaaa and I found the underlying issue.

When all containers in a pod have started, the pod IP appears in the NotReadyAddresses field of a corresponding Endpoints object. For this update of the Endpoints object, we trigger a reconciliation loop. However, we ignore it if not all containers in the pod are ready at this point in time (which can be the case if there is a readinessProbe defined). Then, if all containers in the pod become ready, the pod still doesn’t count as ready (because the readiness gate is not fulfilled yet) - its IP will not be moved from NotReadyAddress to Addresses - the Endpoints object doesn’t get updated - thus no reconcile is triggered again for this Endpoints object - new pods might just sit there until an another update happens on the Endpoints object (for whatever reason).

The solution to this would be to add a pod reconciler which triggers endpoints reconciliations for any associated Endpoints objects. However, this will likely have other implications for the controller (more CPU & memory requirements because it will keep all the pod objects in memory). @M00nF1sh any thoughts?

alfredkrohmer on Apr 6, 2020

@devkid We came to the same conclusion (as seen in #1214). I confirm we’re back to fast rollouts with that change.

AFAIK we already have implicitly hooked informers on pods objects (see for instance the failures messages in #1209)

bpineau on Apr 6, 2020