cluster-api-provider-azure: Windows pod error: failed to setup network for sandbox

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Our VMSS tests include a step to validate a LoadBalancer that sits in front of a Windows pod. Those tests occasionally fail because that pod reports:

  Warning  FailedCreatePodSandBox  24m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6587e3b5ab2f01bbc40aef99de40539a4b7c77e9579efcdec323a404ce7baafc": plugin type="calico" failed (add): global strict affinity should not be false for Windows node
  Warning  FailedCreatePodSandBox  63s (x103 over 24m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1794612b2e974afa3866f41153f7e104c0fbc9bfd16951c00ac77fd8c703e9a2": plugin type="calico" failed (add): global strict affinity should not be false for Windows node

Node metadata for the Windows node that the pod was scheduled onto does not report anything that suggests root cause:

https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-main/1631536507762249728/artifacts/clusters/capz-e2e-0wup7c-vmss/nodes/win-p-win000001/node-describe.txt

What did you expect to happen:

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 24 (21 by maintainers)

Most upvoted comments

Based on timing alone, the start of this flake seems to be related to the PR that switched templates to use the out-of-tree cloud-provider:

Screenshot 2023-03-17 at 11 16 58 AM

If it’s related to out-of-tree cloud-provider and not always happening (i.e. could be a timing issue), I have the strong suspicion that the cause is what I mentioned in #2591:

I just installed the latest helm chart version and it generally seems to work, however sometimes the cloud node manager still fails to start up In those cases it seems that the replace in powershell: […] is working but the kubeconfig is overwritten right afterwards. We verified this by sshing to a node which failed and checked the file which didn’t have the path rewritten. When we executed it again manually, it also switched back again to the original value.

https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/2591#issuecomment-1453717172

I suggest to give this change a try (and/or find the root cause why it’s overwriting).

Can we see if https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/3311 fixes anything? If it doesn’t I’m in favor of removing the Windows node pools from the VMSS tests until we sort the issue out.