aws-load-balancer-controller: Target does not get registered if using readinessGate
I’ve upgraded to v1.1.6 to make use of the pod readiness gate feature to reduce 502/504’s during HPA scales. Then I’ve proceeded to update my deployment per this document.
After I updated my deployment to have the readiness gate, the first pod that had the readinessGate spec got registered by the controller just fine with following status:
status:
conditions:
- lastProbeTime: "2020-03-25T07:17:09Z"
lastTransitionTime: "2020-03-25T07:13:48Z"
status: "True"
type: target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>
The following pod of the same deployment, had its status stuck like this:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-03-25T07:10:29Z"
message: corresponding condition of pod readiness gate "target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>"
does not exist.
reason: ReadinessGatesNotReady
status: "False"
type: Ready
I’ve waited for a good 5 minutes to see that pod get registered into the target group, but it did not. I could reproduce this error by performing multiple rollovers for the same deployment. However if I delete the controller pod and let it restart, it recognizes these pods and registers them into the target group. After the registrations at this point, it stops to register pods again.
Is there anything wrong with the approach I took? Or is this an issue at controllers end?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 7
- Comments: 34 (4 by maintainers)
Commits related to this issue
- React on pod events for readiness gates We're relying on endpoints events to re-trigger reconciliations during rollouts, and we're considering pod's containers status (eg. are all pod's containers Co... — committed to DataDog/aws-alb-ingress-controller by bpineau 4 years ago
- React on pod events for readiness gates We're relying on endpoints events to re-trigger reconciliations during rollouts, and we're considering pod's containers status (eg. are all pod's containers Co... — committed to alebedev87/aws-load-balancer-controller by bpineau 4 years ago
@casret we plan to do a release this week. will try to include https://github.com/kubernetes-sigs/aws-alb-ingress-controller/pull/1211, and will release without it if we cannot merge before friday
Can we get a point release for this bugfix? It affects us as well.
This should be fixed on master now.
@nirnanaaa and I found the underlying issue.
When all containers in a pod have started, the pod IP appears in the
NotReadyAddressesfield of a correspondingEndpointsobject. For this update of theEndpointsobject, we trigger a reconciliation loop. However, we ignore it if not all containers in the pod are ready at this point in time (which can be the case if there is areadinessProbedefined). Then, if all containers in the pod become ready, the pod still doesn’t count as ready (because the readiness gate is not fulfilled yet) - its IP will not be moved fromNotReadyAddresstoAddresses- theEndpointsobject doesn’t get updated - thus no reconcile is triggered again for thisEndpointsobject - new pods might just sit there until an another update happens on theEndpointsobject (for whatever reason).The solution to this would be to add a pod reconciler which triggers endpoints reconciliations for any associated
Endpointsobjects. However, this will likely have other implications for the controller (more CPU & memory requirements because it will keep all the pod objects in memory). @M00nF1sh any thoughts?@devkid We came to the same conclusion (as seen in #1214). I confirm we’re back to fast rollouts with that change.
AFAIK we already have implicitly hooked informers on pods objects (see for instance the failures messages in #1209)