k3d: Cluster cannot be created because of (second) server node's issue
Original problem: I tried to create a cluster with multiple nodes in a single physical machine, this is the command
k3d cluster create -a3 -s3 test3 --trace
and the process stuck forever at this point, when the cluster was firing up the second server node:
DEBU[0005] Waiting for node k3d-test3-server-0 to get ready (Log: 'Running kube-apiserver')
TRAC[0005] NodeWaitForLogMessage: Node 'k3d-test3-server-0' waiting for log message 'Running kube-apiserver' since '2021-06-02 12:45:02.516841645 +0000 UTC'
DEBU[0006] Finished waiting for log message 'Running kube-apiserver' from node 'k3d-test3-server-0'
....
DEBU[0007] Waiting for node k3d-test3-server-1 to get ready (Log: 'k3s is up and running')
TRAC[0007] NodeWaitForLogMessage: Node 'k3d-test3-server-1' waiting for log message 'k3s is up and running' since '2021-06-02 12:45:04.23375296 +0000 UTC
It took a while for k3d-test3-server-1 (the second server node) to start (it kept restarting several times before that). This resulted in other nodes (including servers and agents) could not start up. You can see the “created” status in docker ps output

I have tried to purge everything from docker packages and reinstall them (I tried both docker-ce and apt package). Nothing worked so far. It stuck there forever. The weird thing is my other Ubuntu servers did not have the problem. Therefore I suspect this is a local issue - BUT i have tried to reinstall docker and even switch to other linux user just to make sure there is nothing in between k3d and docker. Any suggestions for debugging are really appreciated.
k3d version
k3d version v4.4.4
k3s version v1.20.6-k3s1 (default)
OS version
LSB Version: core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic
Docker Info:
Client:
Context: default
Debug Mode: false
Server:
Containers: 11
Running: 2
Paused: 0
Stopped: 9
Images: 2
Server Version: 20.10.2
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
Default Runtime: runc
Init Binary: docker-init
containerd version:
runc version:
init version:
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-66-generic
Operating System: Ubuntu 18.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 31.26GiB
Name: AMC_Test
ID: HTTK:CWTG:YLKI:Q2AH:5MXF:MVOE:XHFV:V6BF:ZXDX:ZGZ2:EDWM:ZIIE
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support
Additional logs
longnv@AMC_Test ~ docker logs 0faef53a27b0
time="2021-06-02T12:45:04.467620702Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:04.517018890Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:05.755968864Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:05.801646534Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:06.947352136Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:06.993416610Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:08.305266405Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:08.347054640Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:10.095231184Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:10.144121231Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
time="2021-06-02T12:45:10.144182981Z" level=info msg="Managed etcd cluster not yet initialized"
time="2021-06-02T12:45:10.148323728Z" level=info msg="certificate CN=system:admin,O=system:masters signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.148965934Z" level=info msg="certificate CN=system:kube-controller-manager signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.149625481Z" level=info msg="certificate CN=system:kube-scheduler signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.150251560Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.150885419Z" level=info msg="certificate CN=system:kube-proxy signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.151474780Z" level=info msg="certificate CN=system:k3s-controller signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.152189659Z" level=info msg="certificate CN=cloud-controller-manager signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.152901356Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.153597948Z" level=info msg="certificate CN=system:auth-proxy signed by CN=k3s-request-header-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.154228404Z" level=info msg="certificate CN=etcd-server signed by CN=etcd-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.154830831Z" level=info msg="certificate CN=etcd-client signed by CN=etcd-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.155457319Z" level=info msg="certificate CN=etcd-peer signed by CN=etcd-peer-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.157501228Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.158114939Z" level=info msg="Active TLS secret (ver=) (count 8): map[listener.cattle.io/cn-0.0.0.0:0.0.0.0 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-172.19.0.3:172.19.0.3 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=8E02A6749D1D91B3184B0B5B3517BD3A59F159D6]"
time="2021-06-02T12:45:10.166925515Z" level=info msg="Adding https://172.19.0.3:2380 to etcd cluster [k3d-test3-server-0-673cddf7=https://172.19.0.2:2380]"
time="2021-06-02T12:45:10.171117654Z" level=info msg="Starting etcd for cluster [k3d-test3-server-0-673cddf7=https://172.19.0.2:2380 k3d-test3-server-1-d28439f8=https://172.19.0.3:2380]"
{"level":"info","ts":"2021-06-02T12:45:10.173Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://172.19.0.3:2380"]}
{"level":"info","ts":"2021-06-02T12:45:10.173Z","caller":"embed/etcd.go:468","msg":"starting with peer TLS","tls-info":"cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2021-06-02T12:45:10.174Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://172.19.0.3:2379"]}
{"level":"info","ts":"2021-06-02T12:45:10.174Z","caller":"embed/etcd.go:302","msg":"starting an etcd server","etcd-version":"3.4.13","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.15.10","go-os":"linux","go-arch":"amd64","max-cpu-set":16,"max-cpu-available":16,"member-initialized":false,"name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/k3s/server/db/etcd/member","force-new-cluster":false,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["https://172.19.0.3:2380"],"advertise-client-urls":["https://172.19.0.3:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://172.19.0.3:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"k3d-test3-server-0-673cddf7=https://172.19.0.2:2380,k3d-test3-server-1-d28439f8=https://172.19.0.3:2380","initial-cluster-state":"existing","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2021-06-02T12:45:10.177Z","caller":"etcdserver/backend.go:80","msg":"opened backend db","path":"/var/lib/rancher/k3s/server/db/etcd/member/snap/db","took":"2.328973ms"}
{"level":"warn","ts":"2021-06-02T12:45:10.245Z","caller":"etcdserver/cluster_util.go:76","msg":"failed to get cluster response","address":"https://172.19.0.2:2380/members","error":"Get \"https://172.19.0.2:2380/members\": EOF"}
{"level":"info","ts":"2021-06-02T12:45:10.246Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://172.19.0.3:2379"]}
{"level":"info","ts":"2021-06-02T12:45:10.247Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://172.19.0.3:2379"]}
time="2021-06-02T12:45:10.247410008Z" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,k3s --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --feature-gates=ServiceAccountIssuerDiscovery=false --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
{"level":"warn","ts":"2021-06-02T12:45:10.247Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-06-02T12:45:20.247Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-06-02T12:45:20.247501638Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-06-02T12:45:25.157Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-06-02T12:45:35.247Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-06-02T12:45:35.247878656Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-06-02T12:45:35.985Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 28 (6 by maintainers)
@Jasper-Ben thanks, i checked your logs. And i think you got different problem than mine, in your log it has
see this https://github.com/rancher/k3d/issues/612 , you may want to open a new issue if it does not help. In you case you’ve made it further as all of your servers eventually started. The fix of nf_conntrack_max may help!
@nguyenvulong , that’s still on one single physical host where it’s failing, while working on all other hosts? Is there anything like
ufworfirewalldrunning? You could try withk3d cluster create issue-619 --k3s-arg="--cluster-init"to just create a single server cluster with etcd enabled, followed byk3d node create --trace server-619 --role server --cluster issue-619to add the second server node to it. That way you have all the time to first inspect the initializing server node for any issues as you can already start using the cluster withkubectl. Also you can usekubectl delete nodeandk3d node deleteon the secondary server node to remove and retry without having to re-create the first server node.Here it is @iwilltry42
@iwilltry42
I will update that machine to the latest and let you know within 24h. Thanks a lot for your time.
I linked them here in this https://github.com/rancher/k3d/issues/619#issuecomment-853854949 Please double check and if you think they are not similar i will remove them from my comment.
Thanks, I’ll have a look 🙂
Edit: Workaround did the trick in my case! 🎉 I hope you find a solution to your problem soon 😕