k3d: Cluster cannot be created because of (second) server node's issue

Original problem: I tried to create a cluster with multiple nodes in a single physical machine, this is the command

k3d cluster create -a3 -s3 test3 --trace

and the process stuck forever at this point, when the cluster was firing up the second server node:

DEBU[0005] Waiting for node k3d-test3-server-0 to get ready (Log: 'Running kube-apiserver') 
TRAC[0005] NodeWaitForLogMessage: Node 'k3d-test3-server-0' waiting for log message 'Running kube-apiserver' since '2021-06-02 12:45:02.516841645 +0000 UTC' 
DEBU[0006] Finished waiting for log message 'Running kube-apiserver' from node 'k3d-test3-server-0' 
....
DEBU[0007] Waiting for node k3d-test3-server-1 to get ready (Log: 'k3s is up and running') 
TRAC[0007] NodeWaitForLogMessage: Node 'k3d-test3-server-1' waiting for log message 'k3s is up and running' since '2021-06-02 12:45:04.23375296 +0000 UTC

It took a while for k3d-test3-server-1 (the second server node) to start (it kept restarting several times before that). This resulted in other nodes (including servers and agents) could not start up. You can see the “created” status in docker ps output

image

I have tried to purge everything from docker packages and reinstall them (I tried both docker-ce and apt package). Nothing worked so far. It stuck there forever. The weird thing is my other Ubuntu servers did not have the problem. Therefore I suspect this is a local issue - BUT i have tried to reinstall docker and even switch to other linux user just to make sure there is nothing in between k3d and docker. Any suggestions for debugging are really appreciated.

k3d version

k3d version v4.4.4
k3s version v1.20.6-k3s1 (default)

OS version

LSB Version:	core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.3 LTS
Release:	18.04
Codename:	bionic

Docker Info:

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 11
  Running: 2
  Paused: 0
  Stopped: 9
 Images: 2
 Server Version: 20.10.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-66-generic
 Operating System: Ubuntu 18.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31.26GiB
 Name: AMC_Test
 ID: HTTK:CWTG:YLKI:Q2AH:5MXF:MVOE:XHFV:V6BF:ZXDX:ZGZ2:EDWM:ZIIE
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support

Additional logs

longnv@AMC_Test  ~  docker logs 0faef53a27b0           
time="2021-06-02T12:45:04.467620702Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:04.517018890Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:05.755968864Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:05.801646534Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:06.947352136Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:06.993416610Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:08.305266405Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:08.347054640Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:10.095231184Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:10.144121231Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
time="2021-06-02T12:45:10.144182981Z" level=info msg="Managed etcd cluster not yet initialized"
time="2021-06-02T12:45:10.148323728Z" level=info msg="certificate CN=system:admin,O=system:masters signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.148965934Z" level=info msg="certificate CN=system:kube-controller-manager signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.149625481Z" level=info msg="certificate CN=system:kube-scheduler signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.150251560Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.150885419Z" level=info msg="certificate CN=system:kube-proxy signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.151474780Z" level=info msg="certificate CN=system:k3s-controller signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.152189659Z" level=info msg="certificate CN=cloud-controller-manager signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.152901356Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.153597948Z" level=info msg="certificate CN=system:auth-proxy signed by CN=k3s-request-header-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.154228404Z" level=info msg="certificate CN=etcd-server signed by CN=etcd-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.154830831Z" level=info msg="certificate CN=etcd-client signed by CN=etcd-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.155457319Z" level=info msg="certificate CN=etcd-peer signed by CN=etcd-peer-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.157501228Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.158114939Z" level=info msg="Active TLS secret  (ver=) (count 8): map[listener.cattle.io/cn-0.0.0.0:0.0.0.0 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-172.19.0.3:172.19.0.3 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=8E02A6749D1D91B3184B0B5B3517BD3A59F159D6]"
time="2021-06-02T12:45:10.166925515Z" level=info msg="Adding https://172.19.0.3:2380 to etcd cluster [k3d-test3-server-0-673cddf7=https://172.19.0.2:2380]"
time="2021-06-02T12:45:10.171117654Z" level=info msg="Starting etcd for cluster [k3d-test3-server-0-673cddf7=https://172.19.0.2:2380 k3d-test3-server-1-d28439f8=https://172.19.0.3:2380]"
{"level":"info","ts":"2021-06-02T12:45:10.173Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://172.19.0.3:2380"]}
{"level":"info","ts":"2021-06-02T12:45:10.173Z","caller":"embed/etcd.go:468","msg":"starting with peer TLS","tls-info":"cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2021-06-02T12:45:10.174Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://172.19.0.3:2379"]}
{"level":"info","ts":"2021-06-02T12:45:10.174Z","caller":"embed/etcd.go:302","msg":"starting an etcd server","etcd-version":"3.4.13","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.15.10","go-os":"linux","go-arch":"amd64","max-cpu-set":16,"max-cpu-available":16,"member-initialized":false,"name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/k3s/server/db/etcd/member","force-new-cluster":false,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["https://172.19.0.3:2380"],"advertise-client-urls":["https://172.19.0.3:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://172.19.0.3:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"k3d-test3-server-0-673cddf7=https://172.19.0.2:2380,k3d-test3-server-1-d28439f8=https://172.19.0.3:2380","initial-cluster-state":"existing","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2021-06-02T12:45:10.177Z","caller":"etcdserver/backend.go:80","msg":"opened backend db","path":"/var/lib/rancher/k3s/server/db/etcd/member/snap/db","took":"2.328973ms"}
{"level":"warn","ts":"2021-06-02T12:45:10.245Z","caller":"etcdserver/cluster_util.go:76","msg":"failed to get cluster response","address":"https://172.19.0.2:2380/members","error":"Get \"https://172.19.0.2:2380/members\": EOF"}
{"level":"info","ts":"2021-06-02T12:45:10.246Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://172.19.0.3:2379"]}
{"level":"info","ts":"2021-06-02T12:45:10.247Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://172.19.0.3:2379"]}
time="2021-06-02T12:45:10.247410008Z" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,k3s --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --feature-gates=ServiceAccountIssuerDiscovery=false --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
{"level":"warn","ts":"2021-06-02T12:45:10.247Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-06-02T12:45:20.247Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-06-02T12:45:20.247501638Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-06-02T12:45:25.157Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}

{"level":"warn","ts":"2021-06-02T12:45:35.247Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-06-02T12:45:35.247878656Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-06-02T12:45:35.985Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 28 (6 by maintainers)

Most upvoted comments

@Jasper-Ben thanks, i checked your logs. And i think you got different problem than mine, in your log it has

F0605 09:33:43.803005 7 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

see this https://github.com/rancher/k3d/issues/612 , you may want to open a new issue if it does not help. In you case you’ve made it further as all of your servers eventually started. The fix of nf_conntrack_max may help!

@nguyenvulong , that’s still on one single physical host where it’s failing, while working on all other hosts? Is there anything like ufw or firewalld running? You could try with k3d cluster create issue-619 --k3s-arg="--cluster-init" to just create a single server cluster with etcd enabled, followed by k3d node create --trace server-619 --role server --cluster issue-619 to add the second server node to it. That way you have all the time to first inspect the initializing server node for any issues as you can already start using the cluster with kubectl. Also you can use kubectl delete node and k3d node delete on the secondary server node to remove and retry without having to re-create the first server node.

while this failing cluster creation is still running (meaning that both containers are still running), can you try and do docker exec k3d-test-server-1 traceroute 172.18.0.2 and paste the output here, please?

Here it is @iwilltry42

image

@iwilltry42

Seems like there is something there (Kernel Version or Docker Version (or one of its deps)), that doesn’t work well here 🤔 Any chance you could try with the latest patch releases of the Kernel and Docker? 🤔

I will update that machine to the latest and let you know within 24h. Thanks a lot for your time.

Can you link them here please, so we can try to connect the dots?

I linked them here in this https://github.com/rancher/k3d/issues/619#issuecomment-853854949 Please double check and if you think they are not similar i will remove them from my comment.

@Jasper-Ben thanks, i checked your logs. And i think you got different problem than mine, in your log it has

F0605 09:33:43.803005 7 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

see this #612 , you may want to open a new issue if it does not help. In you case you’ve made it further as all of your servers eventually started. The fix of nf_conntrack_max may help!

Thanks, I’ll have a look 🙂

Edit: Workaround did the trick in my case! 🎉 I hope you find a solution to your problem soon 😕