kubernetes: 1.16: etcd client does not parse IPv6 addresses correctly when members are joining

What happened: Running kubeadm 1.16.1 on CentOS7 with local etcd, the first kube master (2001:db8:101:53e9::1:17) is up and running, try to make the second master (2001:db8:101:53e9::1:2d) join the cluster and got error:

I1006 00:05:57.730235   30231 etcd.go:107] etcd endpoints read from pods: https://[2001:db8:101:53e9::1:17]:2379
I1006 00:05:57.750843   30231 etcd.go:156] etcd endpoints read from etcd: https://[2001:db8:101:53e9::1:17]:2379
I1006 00:05:57.750907   30231 etcd.go:125] update etcd endpoints: https://[2001:db8:101:53e9::1:17]:2379
failed to dial endpoint https://[2001:db8:101:53e9::1:17]:2379 with maintenance client: context deadline exceeded
etcd cluster is not healthy
k8s.io/kubernetes/cmd/kubeadm/app/phases/etcd.CheckLocalEtcdClusterStatus
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/etcd/local.go:87
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runCheckEtcdPhase
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/checketcd.go:68
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:236
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:424
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:209
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdJoin.func1
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:169
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:830
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
k8s.io/kubernetes/cmd/kubeadm/app.Run
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
	_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
	/usr/local/go/src/runtime/proc.go:200
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337
error execution phase check-etcd
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:237
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:424
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:209
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdJoin.func1
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:169
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:830
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
k8s.io/kubernetes/cmd/kubeadm/app.Run
	/workspace/anago-v1.16.1-beta.0.37+d647ddbd755faf/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
	_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
	/usr/local/go/src/runtime/proc.go:200
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337

log of etcd on the first kube master shows:

2019-10-06 18:31:14.493853 I | mvcc: finished scheduled compaction at 56939 (took 1.44269ms)
2019-10-06 18:33:03.880905 I | embed: rejected connection from "[2001:db8:101:53e9::1:2d]:60812" (error "remote error: tls: bad certificate", ServerName "[2001")
2019-10-06 18:33:04.891260 I | embed: rejected connection from "[2001:db8:101:53e9::1:2d]:60818" (error "remote error: tls: bad certificate", ServerName "[2001")
2019-10-06 18:33:06.416960 I | embed: rejected connection from "[2001:db8:101:53e9::1:2d]:60820" (error "remote error: tls: bad certificate", ServerName "[2001")
2019-10-06 18:33:08.748958 I | embed: rejected connection from "[2001:db8:101:53e9::1:2d]:60822" (error "remote error: tls: bad certificate", ServerName "[2001")
2019-10-06 18:33:13.451584 I | embed: rejected connection from "[2001:db8:101:53e9::1:2d]:60828" (error "remote error: tls: bad certificate", ServerName "[2001")
2019-10-06 18:33:18.742184 I | embed: rejected connection from "[2001:db8:101:53e9::1:2d]:60834" (error "remote error: tls: bad certificate", ServerName "[2001")
2019-10-06 18:36:14.504374 I | mvcc: store.index: compact 57302

Verified the cert used by the 2nd kube master has the right server name (of the IP).

What you expected to happen: kubeadm join bring the 2nd kube master into the cluster.

How to reproduce it (as minimally and precisely as possible):

  1. kubeadm init on master-1
  2. kubeadm join on master-2

Anything else we need to know?:

  1. I used the same steps to create 1.15.3 without problem
  2. This is a pure IPv6 env that does not have any IPv4.

Environment:

  • Kubernetes version (use kubectl version):
# kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T17:01:15Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T16:51:36Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Openstack
  • OS (e.g: cat /etc/os-release):
# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a):
# uname -a
Linux x1-master-2.x1-host.rkn.ksng.io 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T16:58:27Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
  • Network plugin and version (if this is a network-related bug): calico
  • Others: host has IPv6 only, there is no IPv4 interfaces.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 29 (24 by maintainers)

Most upvoted comments

We will backport the etcd fix to 3.4 and 3.3 once it is merged to master.

and this works consistently?

Yes 1.15.3 always worked fine, fyi, we have 5 or 6 clusters running 1.15.3 (both ubuntu and centos, if it matters), couple of dev/test clusters got redeployed roughly once a week in the past couple of months, there are more personal deployments.

also please mind that 1.16 includes a new version of etcd,

I tried to tag k8s.gcr.io/etcd:3.3.10 from 1.15.3 to the new version and the behavior was the same, I tested only once though.