cluster-api: failure to connect to etcd pod through proxy with IPv6 only listeners prevents scaling workload clusters to multiple control plane nodes
What steps did you take and what happened:
- OK: Deployed a management cluster on an IPv6 only vSphere environment with a single control plane node
- OK: Deployed a workload cluster on the same infrastructure with a single control plane node
- FAIL: Attempted to deploy another workload cluster on the same infrastructure with 3 worker nodes and 3 control plane nodes
What happened:
Cluster gets stuck when scaling up the control plane:
ClusterReadycondition staysFalsewith reasonScalingUpand messageScaling up control plane to 3 replicas (actual 1)KubeadmControlPlaneReadycondition same as aboveKubeadmControlPlaneEtcdClusterHealthyConditionhas statusUnknownwith reasonEtcdClusterUnknownand messageFollowing machines are reporting unknown etcd member status: foo-prod-control-plane-g7s2c- logs for
capi-kubeadm-control-planepod show"failures"="machine foo-prod-control-plane-g7s2 c reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the foo-prod-control-plane-g7s2c node) - logs for a patched version of
capi-kubeadm-control-planeshowunable to create etcd client: endpoints: [etcd-foo-prod-control-plane-g7s2c], proxy.KubeConfig.Host: https://[2001:1900:2200:5f75::aba2]:6443: context deadline exceeded - On the control plane node, containerd logs show failures attempting to dial
127.0.0.1:2379failed to execute portforward in network namespace "host": failed to dial 2379: dial tcp4 127.0.0.1:2379: connect: connection refused, which sounds related to kubernetes/kubernetes#72597
More details:
Cluster Ready condition stays False with reason ScalingUp and message Scaling up control plane to 3 replicas (actual 1)
$ kubectl get cluster foo-prod -o yaml | yq .status
{
"conditions": [
{
"lastTransitionTime": "2021-03-02T16:58:43Z",
"message": "Scaling up control plane to 3 replicas (actual 1)",
"reason": "ScalingUp",
"severity": "Warning",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2021-03-02T16:58:43Z",
"message": "Scaling up control plane to 3 replicas (actual 1)",
"reason": "ScalingUp",
"severity": "Warning",
"status": "False",
"type": "ControlPlaneReady"
},
{
"lastTransitionTime": "2021-03-02T16:58:38Z",
"status": "True",
"type": "InfrastructureReady"
}
],
"controlPlaneInitialized": true,
"controlPlaneReady": true,
"infrastructureReady": true,
"observedGeneration": 2,
"phase": "Provisioned"
}
KubeadmControlPlane Ready condition same as above
$ kubectl get kcp foo-prod-control-plane -o yaml | yq .status
{
"conditions": [
{
"lastTransitionTime": "2021-03-02T16:58:43Z",
"message": "Scaling up control plane to 3 replicas (actual 1)",
"reason": "ScalingUp",
"severity": "Warning",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2021-03-02T17:00:29Z",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2021-03-02T16:58:39Z",
"status": "True",
"type": "CertificatesAvailable"
},
{
"lastTransitionTime": "2021-03-02T17:03:02Z",
"status": "True",
"type": "ControlPlaneComponentsHealthy"
},
{
"lastTransitionTime": "2021-03-02T17:03:04Z",
"message": "Following machines are reporting unknown etcd member status: foo-prod-control-plane-g7s2c",
"reason": "EtcdClusterUnknown",
"status": "Unknown",
"type": "EtcdClusterHealthyCondition"
},
{
"lastTransitionTime": "2021-03-02T17:00:17Z",
"status": "True",
"type": "MachinesReady"
},
{
"lastTransitionTime": "2021-03-02T16:58:40Z",
"message": "Scaling up control plane to 3 replicas (actual 1)",
"reason": "ScalingUp",
"severity": "Warning",
"status": "False",
"type": "Resized"
}
],
"initialized": true,
"observedGeneration": 1,
"ready": true,
"readyReplicas": 1,
"replicas": 1,
"selector": "cluster.x-k8s.io/cluster-name=foo-prod,cluster.x-k8s.io/control-plane",
"updatedReplicas": 1
}
KubeadmControlPlane EtcdClusterHealthyCondition has status Unknown with reason EtcdClusterUnknown and message Following machines are reporting unknown etcd member status: foo-prod-control-plane-g7s2c
$ kubectl logs -n capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-5c74b6c7b7-nbdn9 manager | tail -n 5
I0302 17:42:09.830683 1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="foo-prod" "kubeadmControlPlane"="foo-prod-control-plane" "namespace"="default" "failures"="machine foo-prod-control-plane-g7s2
c reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the foo-prod-control-plane-g7s2c node)"
I0302 17:42:25.037015 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="foo-prod" "kubeadmControlPlane"="foo-prod-control-plane" "namespace"="default"
I0302 17:42:29.511846 1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="foo-prod" "kubeadmControlPlane"="foo-prod-control-plane" "namespace"="default" "Desired"=3 "Existing"=1
I0302 17:42:29.511988 1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="foo-prod" "kubeadmControlPlane"="foo-prod-control-plane" "namespace"="default" "failures"="machine foo-prod-control-plane-g7s2
c reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the foo-prod-control-plane-g7s2c node)"
I0302 17:42:44.810311 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="foo-prod" "kubeadmControlPlane"="foo-prod-control-plane" "namespace"="default"
logs for a patched version of capi-kubeadm-control-plane show unable to create etcd client: endpoints: [etcd-foo-prod-control-plane-g7s2c], proxy.KubeConfig.Host: https://[2001:1900:2200:5f75::aba2]:6443: context deadline exceeded
$ kubectl logs -n capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-7dd9ff5b8b-52fd6 manager | tail -n 5 | grep scale
I0303 17:40:30.970734 1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="foo-prod" "kubeadmControlPlane"="foo-prod-control-plane" "namespace"="default" "failures"="machine foo-prod-control-plane-g7s2c reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the foo-prod-control-plane-g7s2c node: unable to create etcd client: endpoints: [etcd-foo-prod-control-plane-g7s2c], proxy.KubeConfig.Host: https://[2001:1900:2200:5f75::aba2]:6443: context deadline exceeded)"
On the control plane node, containerd logs show failures attempting to dial 127.0.0.1:2379 failed to execute portforward in network namespace "host": failed to dial 2379: dial tcp4 127.0.0.1:2379: connect: connection refused, which sounds related to kubernetes/kubernetes#72597, and looks like it’s coming from kubernetes/kubernetes/pkg/kubelet/cri/streaming/portforward/httpstream.go (haven’t traced it down to the innermost error yet).
root@foo-prod-control-plane-kng8h [ ~ ]# journalctl -u containerd | grep "error forwarding" | tail -n 3
Mar 04 15:06:34 foo-prod-control-plane-kng8h containerd[776]: E0304 15:06:34.810525 776 httpstream.go:257] error forwarding port 2379 to pod b26dad3e47ad6b0d02d085615c5a00e8be5ef689b5b5efdb09ad8ea0672c6884, uid : failed to execute portforward in network namespace "host": failed to dial 2379: dial tcp4 127.0.0.1:2379: connect: connection refused
Mar 04 15:06:52 foo-prod-control-plane-kng8h containerd[776]: E0304 15:06:52.474727 776 httpstream.go:257] error forwarding port 2379 to pod b26dad3e47ad6b0d02d085615c5a00e8be5ef689b5b5efdb09ad8ea0672c6884, uid : failed to execute portforward in network namespace "host": failed to dial 2379: dial tcp4 127.0.0.1:2379: connect: connection refused
Mar 04 15:06:53 foo-prod-control-plane-kng8h containerd[776]: E0304 15:06:53.498592 776 httpstream.go:257] error forwarding port 2379 to pod b26dad3e47ad6b0d02d085615c5a00e8be5ef689b5b5efdb09ad8ea0672c6884, uid : failed to execute portforward in network namespace "host": failed to dial 2379: dial tcp4 127.0.0.1:2379: connect: connection refused
What did you expect to happen:
Not sure: we are in the early stages of exploring what level of support currently exists for IPv6 in cluster API.
Anything else you would like to add:
I hesitate to call this a bug as we are in the early stages of exploring what level of support currently exists for IPv6 in cluster API. Happy to discuss other ways to track the larger (set of?) feature requests implicit in this bug report to track IPv6 support across cluster api and various cluster api providers. However, I did want to capture what I’ve found so far with this particular issue.
In our case, our explorations have been initially focused on the vSphere provider. We have been able successfully deploy management and workload clusters with single control plane nodes on vSphere, using a fork of kube-vip with some early prototyping of ipv6 support for control plane VIPs.
We encountered this issue in our first attempts to deploy workload clusters with multiple control plane nodes.
Workaround
In our case, there is a loopback device with 127.0.01 on these nodes, so we can work around this issue with the following change to the etcd manifest:
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://[2001:1900:2200:5f75::65]:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https://[2001:1900:2200:5f75::65]:2380
- --initial-cluster=foo-prod-control-plane-g7s2c=https://[2001:1900:2200:5f75::d4]:2380,foo-prod-control-plane-kng8h=https://[2001:1900:2200:5f75::65]:2380
- --initial-cluster-state=existing
- --key-file=/etc/kubernetes/pki/etcd/server.key
- - --listen-client-urls=https://[::1]:2379,https://[2001:1900:2200:5f75::65]:2379
+ - --listen-client-urls=https://[::1]:2379,https://[2001:1900:2200:5f75::65]:2379,https://127.0.0.1:2379
- --listen-metrics-urls=http://[::1]:2381
- --listen-peer-urls=https://[2001:1900:2200:5f75::65]:2380
- --name=foo-prod-control-plane-kng8h
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
As a hack in the meantime, this edit can be made via postKubeadmCommands:
postKubeadmCommands:
- sed -i '/listen-client-urls/ s/$/, "https:\/\/127.0.0.1:2379"/' /etc/kubernetes/manifests/etcd.yaml
Environment:
- Cluster-api version:
v0.3.14 - Minikube/KIND version: n/a
- Kubernetes version: (use
kubectl version):v1.20.1 - OS (e.g. from
/etc/os-release): photon v3.0
/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (16 by maintainers)
We can probably close this out?
We’re using the same hack for ipv6 in CAPZ currently https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/flavors/ipv6/patches/kubeadm-controlplane.yaml#L9
cc @aramase
This should be fixed in https://github.com/containerd/containerd/pull/5145
Yup, I’d file a new “bug” issue related to that original as it does affect IPv6 in general. Shouldn’t need to specify ::1 on ipv6 IMO.