kubernetes: GKE network load balancer target pools get updated slowly for changes in node pools
BUG REPORT
When adding new node pools or scaling a node pool not all of the network load balancers pointing to services of type LoadBalancer get their target pools updated. Over time this leads to target pools pointing to non-existing nodes.
Kubernetes version
Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:22:15Z", GoVersion:"go1.6.3
", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.7", GitCommit:"92b4f971662de9d8770f8dcd2ee01ec226a6f6c0", GitTreeState:"clean", BuildDate:"2016-12-10T04:43:42Z", GoVersion:"go1.6.3
", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- GKE 1.4.7:
- gci image
What happened: When adding a new node pool and remove and old one part of the GCE network load balancers pointing to services of type LoadBalancer don’t have their target pools pointing to the new nodes.
Deleting the service and recreating it is currently our only way of fixing this, which leads to downtime.
It happens in multiple clusters, which started at different versions than they’re currently at. It also happens for newly created services, so it doesn’t seem to be related to services originally being created at an older version of Kubernetes.
What you expected to happen: Target pools to always remain in sync with the actual GKE cluster and its node pools.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 4
- Comments: 25 (13 by maintainers)
This issue is still causing us trouble in Kubernetes Engine version
1.9.2-gke.1.Last week we shifted hosts from one node pool to another (not really shifting, but growing one node pool while shrinking the other) and due to the load balancers for service of type
LoadBalancerwithexternalTrafficPolicy: Localnot being updated quick enough they were left with non-existing hosts rendering our service inaccessible.This is in a Kubernetes Engine cluster of approx 60 hosts, 236 load balanced services and constant change in the number of hosts due to the cluster autoscalers. It seems the service-controller or whatever component is responsible for updating target pools is just not up to the task or deliberately throttled in order not to overwhelm the google api.
Ran into this issue when rolling over from one node pool to another to use a different machine type. While the issue says the updates are slow, I saw that the services just didn’t get updated at all and ended up manually reconfiguring the load balancer in the GCP console.
It seems important for services to not go down when doing operations like rolling over node pools, and I wonder if another improvement would be when deleting a node pool, for Kubernetes to verify that there are no resources at all that reference those nodes - failing to delete the node pool would be much better than disabling servicees.
Removing orphaned network load balancers seem to have fixed the fact that some of them didn’t get updated at all.
A test learns though that the update process is really slow. For 116 target pools it takes 37 minutes.
When scaling up this would lead to unbalanced incoming traffic, because the majority will still go to the nodes present at the start of the scaling action.
But for scaling down it’s far worse. For our production cluster with over 150 target pools it would take up to 50 minutes for all target pools to reflect the removed node. Until it’s removed the network load balancer will keep sending traffic to a non-existing node, leading to failed requests.