kubernetes: Windows: Disappearing/missing windows loadbalancers for k8s services

What happened?

In our windows clusters, sometimes, we notice that the windows pod cannot access a clusterIP service.

The instances we have seen this issue is when a windows pod tries to resolve a DNS name by trying to contact the DNS server clusterIP (i.e nslookup fails from inside a pod). When this happens, a hnsdiag list loadbalancers does not show the VIP for the kube-dns clusterIP. But kube-proxy logs indicate that this loadbalancer was created. So it likely got deleted at some point (or did not get created at all). The DNS server inside the pod is correct. As an experiment, we reset DNS server (inside pod) to one of the kube-dns endpoint IP (directly pod IP) address and nslookup worked correctly. We set it back to DNS clusterIP and it started failing again.

Once we delete a kube-dns pod (running on a linux node) to let it get recreated, we see that a new event to windows kube-proxy will trigger the creation of loadbalancer again and things start working correctly.

CC: @jsturtevant @daschott

What did you expect to happen?

We expect that the windows loadbalancers is in sync with kubernetes services.

How can we reproduce it (as minimally and precisely as possible)?

There is no easy reproduction case.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here
API server: 1.22.8 Nodes: 1.21.10

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
This is Windows 2019 server. I will update the bug with exact details soon (as soon as I get access to those nodes again).

PS C:\Windows\system32> wmic os get Version Version 10.0.17763

PS C:\Windows\system32> wmic os get Caption Caption Microsoft Windows Server 2019 Datacenter

PS C:\Windows\system32> wmic os get Version Version 10.0.17763

PS C:\Windows\system32> wmic os get BuildNumber BuildNumber 17763

PS C:\Windows\system32> wmic os get OSArchitecture OSArchitecture 64-bit

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 27 (8 by maintainers)

Most upvoted comments

@shettyg Can you collect the output of this script on the node (if possible within a day after the issue reproduces)? https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1

Do you mean the GitHub issue? I will create one.