kubernetes: Windows: Disappearing/missing windows loadbalancers for k8s services
What happened?
In our windows clusters, sometimes, we notice that the windows pod cannot access a clusterIP service.
The instances we have seen this issue is when a windows pod tries to resolve a DNS name by trying to contact the DNS server clusterIP (i.e nslookup fails from inside a pod). When this happens, a hnsdiag list loadbalancers does not show the VIP for the kube-dns clusterIP. But kube-proxy logs indicate that this loadbalancer was created. So it likely got deleted at some point (or did not get created at all). The DNS server inside the pod is correct. As an experiment, we reset DNS server (inside pod) to one of the kube-dns endpoint IP (directly pod IP) address and nslookup worked correctly. We set it back to DNS clusterIP and it started failing again.
Once we delete a kube-dns pod (running on a linux node) to let it get recreated, we see that a new event to windows kube-proxy will trigger the creation of loadbalancer again and things start working correctly.
What did you expect to happen?
We expect that the windows loadbalancers is in sync with kubernetes services.
How can we reproduce it (as minimally and precisely as possible)?
There is no easy reproduction case.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
# paste output here
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
PS C:\Windows\system32> wmic os get Version Version 10.0.17763
PS C:\Windows\system32> wmic os get Caption Caption Microsoft Windows Server 2019 Datacenter
PS C:\Windows\system32> wmic os get Version Version 10.0.17763
PS C:\Windows\system32> wmic os get BuildNumber BuildNumber 17763
PS C:\Windows\system32> wmic os get OSArchitecture OSArchitecture 64-bit
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 27 (8 by maintainers)
@shettyg Can you collect the output of this script on the node (if possible within a day after the issue reproduces)? https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1
Do you mean the GitHub issue? I will create one.