cloud-provider-azure: ipv6 + Calico + out of tree cloud-provider is broken

While moving all CAPZ templates to out-of-tree CCM, I was not able to get a reliably passing test for ipv6. This test is very flaky (it fails most of the time). Looking into it I found that the error is that kubelet can’t patch the status of one of the kube-system pods (sometimes kube-apiserver, sometimes kube-scheduler, sometimes kube-controller-manager). One example of the error can be seen here. The error is

	Pod \"kube-controller-manager-capz-e2e-4laer3-ipv6-control-plane-6v7dn\" is invalid: [status.podIPs: Invalid value: []core.PodIP{core.PodIP{IP:\"2001:1234:5678:9abc::5\"}, core.PodIP{IP:\"2001:1234:5678:9abc::5\"}}: may specify no more than one IP for each IP family, status.podIPs[1]: Duplicate value: core.PodIP{IP:\"2001:1234:5678:9abc::5\"}]"

The same error does not repro with the in-tree cloud-provider (see passing ipv6 test with CAPZ and in-tree cloud-provider here).

The cloud-provider-azure test for ipv6 is also failing (although the failure is different: https://testgrid.k8s.io/provider-azure-cloud-provider-azure#cloud-provider-azure-master-ipv6-capz)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 38 (36 by maintainers)

Most upvoted comments

@CecileRobertMichon @lzhecheng please check https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/3705-cloud-node-ips , this is WIP and a known and complicated problem, we also have some ammends because of the complexity https://github.com/kubernetes/enhancements/pull/3898

Please check the KEP to see if you are case is contemplated and if not, please let us know

Let me try without it

update: the e2e test passed several times in a row when node-ip is not set kubernetes-sigs/cluster-api-provider-azure#3221

Glad to know that. I think CNM doesn’t need to change because as you said, IPv4 first and then IPv6 is the expected behaviour for CAPZ dualstack and IPv6 only.

Update: I think I found the root cause: kubelet is not properly processing host IPs. OOT cloud-provider behaves little differently from before, which directly leads to failure but I think it is kubelet that should handle the situation. Will do further verification tomorrow. And then make a fix and detailed root cause analysis.

@lzhecheng yes but you’ll need to create a template based on the custom-builds.yaml template that uses out of tree + ipv6 (that’s not a combination of features we have in the templates in the repo unfortunately): https://capz.sigs.k8s.io/developers/kubernetes-developers.html#kubernetes-117

@lzhecheng the 3rd control plane node is not coming up because of the “invalid pod IPs” issue in the 2nd control plane: Cluster API checks the health of all control plane components before scaling up and in this case the check is failing because the scheduler pod is not healthy