kubernetes: Starting in 1.25 clusters, services of type=LB and xTP=Local sometimes does not update node backends on load balancers
What happened?
When upgrading nodes from 1.24 to 1.25, on a cluster where master is already at 1.25, I notice that my Service type=LoadBalancer and xTP=Local have an incorrect set of nodes after the nodes have been upgraded. The set contains only the old nodes that no longer exist resulting my service being unavailable through my load balancer.
What did you expect to happen?
I would expect that the load balancer to be properly updated with the new set of nodes after the upgrade.
How can we reproduce it (as minimally and precisely as possible)?
- Create 1.25 Cluster with 1.24 nodes
- Deploy a service of
type=LoadBalancerandxTP=Local - Upgrade the 1.24 nodes to 1.25
- After upgrade is finished look at the node list for the Load Balancer.
Anything else we need to know?
The existing logging is not enough to diagnose the issue. I added some more logs and ran the KCM at log level =5 to find the root cause.
There was a change introduced to reduce the number syncs for xTP=Local services: #109706. With this change, there are situations where the xTP=Local never gets updated.
The following is the chain of events.
- Node is created or deleted and causes
triggerNodeSync():
- Inside
nodeTriggerSync(), the nodeLister (line 264) filters for Ready only nodes so it does not have the new node or still contains the deleted node, which means thatc.needFullSync = falsewhen line 281 is executed.
-
Following down the chain of functions that are called (across goroutines communicating with nodeSyncCh), we end up at
nodeSyncInternal. Becausec.needFullSync = false, we will only do a sync of services that were marked for retry. If the state previously was good, this meansc.servicesToUpdatehas 0 services before enteringupdateLoadBalancerHosts. https://github.com/kubernetes/kubernetes/blob/a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L725-L741 -
Nodes are queried again from the NodeLister. But this time the new node or the deleted node is reflected. https://github.com/kubernetes/kubernetes/blob/a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L782-L811
-
nodeSyncService is then parallelized based on the length of services. In this case we have no services, so we do no updates but on line 808, c.lastSyncedNodes is set to the nodes found in step 4. https://github.com/kubernetes/kubernetes/blob/a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L808
A subsequent full sync does not fix things. We go through steps 1 through 5 again, the difference being c.needFullSync=true. This will mean that in step 4 & 5, c.servicesToUpdate will not be empty resulting in the following:
- inside
nodeSyncServicewe filter based on predicates for thexTP=LocalandxTP=Cluster. In the case ofxTP=Localsince we do not pay attention toReadystatus, all of the nodes in thec.lastSyncedNodeswill be theoldNodes. And since no node creations or deletions have occurred, all thenewNodeswill be the same. This results in no sync (line 767).
In the the xTP=Cluster, the node ready status is used to filter the nodes, so c.LastSyncedNodes would already have the newly created node, but it would not be ready. Which means the node will not be part of oldNodes but it will exist in newNodes and allows the sync to continue as expected.
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2", GitCommit:"5835544ca568b757a8ecae5c153f317e5736700e", GitTreeState:"clean", BuildDate:"2022-09-21T14:33:49Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2-gke.300", GitCommit:"6f9a8e57036ff71785ef9c90998437413a3a8ff5", GitTreeState:"clean", BuildDate:"2022-09-26T09:26:16Z", GoVersion:"go1.19.1 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
OS version
N/A
Install tools
N/A
Container runtime (CRI) and version (if applicable)
N/A
Related plugins (CNI, CSI, …) and versions (if applicable)
N/A
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (15 by maintainers)
Commits related to this issue
- UPSTREAM <112793>: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet.... — committed to openshift/cloud-provider-aws by lobziik 2 years ago
- UPSTREAM <112793>: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet.... — committed to lobziik/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM <112793>: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet.... — committed to lobziik/cloud-provider-gcp by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to lobziik/cloud-provider-gcp by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to lobziik/cloud-provider-aws by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to lobziik/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-gcp by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-gcp by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to JoelSpeed/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-gcp by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-aws by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-gcp by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-aws by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-azure by lobziik 2 years ago
- UPSTREAM: 112793: Replace k8s.io/cloud-provider with openshift's version Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2) is broken. This issue was fixed upstream, but not released yet. ... — committed to openshift-cloud-team/cloud-provider-azure by lobziik 2 years ago
I have done a test with code on master and looked through the changes that have been made since 1.25 was cut and this issue seems to be fixed. I believe this issue is specific only to 1.25