cluster-api-provider-azure: Windows pod error: failed to setup network for sandbox
/kind bug
[Before submitting an issue, have you checked the Troubleshooting Guide?]
What steps did you take and what happened: [A clear and concise description of what the bug is.]
Our VMSS tests include a step to validate a LoadBalancer that sits in front of a Windows pod. Those tests occasionally fail because that pod reports:
Warning FailedCreatePodSandBox 24m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6587e3b5ab2f01bbc40aef99de40539a4b7c77e9579efcdec323a404ce7baafc": plugin type="calico" failed (add): global strict affinity should not be false for Windows node
Warning FailedCreatePodSandBox 63s (x103 over 24m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1794612b2e974afa3866f41153f7e104c0fbc9bfd16951c00ac77fd8c703e9a2": plugin type="calico" failed (add): global strict affinity should not be false for Windows node
Node metadata for the Windows node that the pod was scheduled onto does not report anything that suggests root cause:
What did you expect to happen:
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- cluster-api-provider-azure version:
- Kubernetes version: (use
kubectl version
): - OS (e.g. from
/etc/os-release
):
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 24 (21 by maintainers)
Based on timing alone, the start of this flake seems to be related to the PR that switched templates to use the out-of-tree cloud-provider:
If it’s related to out-of-tree cloud-provider and not always happening (i.e. could be a timing issue), I have the strong suspicion that the cause is what I mentioned in #2591:
https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/2591#issuecomment-1453717172
I suggest to give this change a try (and/or find the root cause why it’s overwriting).
Can we see if https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/3311 fixes anything? If it doesn’t I’m in favor of removing the Windows node pools from the VMSS tests until we sort the issue out.