k3s: Unable to join an updated master node.

Environmental Info: K3s Version: HA installation mixed version 1.19 & 1.20

k3s version v1.19.5+k3s2 (746cf403) (Master 1) k3s version v1.20.0+k3s2 (2ea6b163) (Master 2, which doesn’t works) go version go1.15.5 (Master 2, which doesn’t works) Node(s) CPU architecture, OS, and Version:

Linux noldork3sM1 5.4.79-v7+ #1373 SMP Mon Nov 23 13:22:33 GMT 2020 armv7l GNU/Linux Linux noldork3sM2 5.4.79-v7+ #1373 SMP Mon Nov 23 13:22:33 GMT 2020 armv7l GNU/Linux Linux noldork3sN1 5.4.79-v7+ #1373 SMP Mon Nov 23 13:22:33 GMT 2020 armv7l GNU/Linux (Worker node, k3s versionk3s version v1.20.0+k3s2 (2ea6b163) go version go1.15.5. Joined and working) Cluster Configuration:

2xMaster 4xWorkers Describe the bug:

After updated noldork3sM2, the node is always on NoReady status noldork3sm2 NotReady control-plane,master Steps To Reproduce:

  • Installed K3s: export K3S_KUBECONFIG_MODE="644" export INSTALL_K3S_EXEC=" --tls-san noldork3s.noldor.local --no-deploy servicelb --disable traefik --node-taint k3s-controlplane=true:NoSchedule --datastore-endpoint mysql://user:password@tcp(server:3306)/database" curl -sfL https://get.k3s.io | sh Expected behavior:

The node would join to the cluster successfully Actual behavior:

Stuck in NoReady Additional context / logs:

`Jan  8 12:15:30 localhost k3s[439]: E0108 12:15:30.212199     439 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
Jan  8 12:15:36 localhost k3s[439]: E0108 12:15:36.428889     439 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan  8 12:15:36 localhost k3s[439]: E0108 12:15:36.707739     439 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/noldork3sm2?timeout=10s": context deadline exceeded
Jan  8 12:15:38 localhost k3s[439]: E0108 12:15:38.260039     439 leaderelection.go:325] error retrieving resource lock kube-system/cloud-controller-manager: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io cloud-controller-manager)
Jan  8 12:15:42 localhost k3s[439]: E0108 12:15:42.794165     439 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan  8 12:15:43 localhost k3s[439]: E0108 12:15:43.086538     439 storage_flowcontrol.go:137] failed creating mandatory flowcontrol settings: failed getting mandatory FlowSchema exempt due to the server was unable to return a response in the time allotted, but may still be processing the request (get flowschemas.flowcontrol.apiserver.k8s.io exempt), will retry later
Jan  8 12:15:44 localhost k3s[439]: E0108 12:15:44.760922     439 repair.go:118] unable to refresh the service IP block: the server was unable to return a response in the time allotted, but may still be processing the request (get services)
Jan  8 12:15:44 localhost k3s[439]: E0108 12:15:44.764442     439 repair.go:75] unable to refresh the port block: the server was unable to return a response in the time allotted, but may still be processing the request (get services)

`

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 40 (5 by maintainers)

Most upvoted comments

@nicklasfrahm You’re my fucking hero 😄, as you recommended:

  • I removed the secret : kubectl delete secret coredns-token-cj77r -n kube-system
  • Delete the pod: kubectl delete pod coredns-66c464876b-25kl9 -n kube-system
  • Everything working flawless…

Many thanks !

There is a known issue with v1.20 on arm caused by a golang compiler issue. v1.20.2 will include a workaround: https://github.com/kubernetes/kubernetes/issues/97685

@i5Js I am not exactly a Kubernetes expert, but as @brandond said it could be something with the service account tokens. I am not sure if this might mess with stuff, but you could try to delete the service account token of the coredns pod and recreate it.

The service account token is stored in a secret kubectl -n kube-system get secrets | grep "coredns". I don’t know the commands from the top of my head, but you might find stuff online. Later during the day I might have time to write up the commands.

https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service

If you leave the node in the cluster during the maintenance operation, you need to run kubectl uncordon <node name>afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.

Hello @brandond Thanks the info. Since I’m unable to restore my actual cluster due to certificate issue, even after downgraded the masters to 1.19, I’m going to reinstall it to 1.19 and wait till future versions.

Well I’ve deployed a new cluster with VMs, all Debian 10 2 Masters and 2 Workers. I’ve upgraded them, remove one node, rebooted another, and everything worked as expected. There are only two differences with my pi’s installations: The masters have not any taint configuration, and the kernel is 4 instead 5…

No reason to be sorry, I tend to get lost in technical jibberish. Sorry for that, it is not very inclusive. CI (continuous integration) “is the practice of merging all developers’ working copies to a shared mainline several times a day”, whereas CD (continous delivery ) is an approach to “produce software in short cycles, ensuring that the software can be reliably released at any time” according to Wikipedia. This is often done by automating your builds, tests and deployments via for example GitHub actions or GitLab CI / CD. 😄 🚀

Awesome! I’m going to check it, I see you have also Armbian, like me.

Sorry for ask, but what means CI / CD?

No, I’m afraid not… I’m going to create a new clusters with VMs this time, try to upgrade them, and see if there are no issues. I’m a bit tired to recreate the cluster one time and another in my raspberrys

So, downgrade from 1.20 to 1.19 right?

I am not sure about that. Try the following:

  1. Run sudo systemctl stop k3s on all nodes
  2. Run BINARY=k3s-armhf sudo curl -SL https://github.com/rancher/k3s/releases/download/v1.19.5+k3s2/$BINARY -o /usr/local/bin/k3s, while making sure that the BINARY variable is set to:
    • k3s-armhf for $(uname -m) == "armv7l"
    • k3s-arm64 for $(uname -m) == "aarch64"
    • k3s for $(uname -m) == "x86_64"
  3. Run sudo systemctl restart k3s on the control plane nodes
  4. Wait until all control plane nodes show up as ready and with the version v1.19.5+k3s2 by running watch kubectl get nodes
  5. Run sudo systemctl restart k3s on all worker nodes

I’ve restarted the services on node 1 and now I’m getting certificate issues all the time: TLS handshake error from ip_node1:3721: remote error: tls: bad certificate" TLS handshake error from ip_node2:3721: remote error: tls: bad certificate" TLS handshake error from ip_node3:3721: remote error: tls: bad certificate" TLS handshake error from ip_node4:3721: remote error: tls: bad certificate" TLS handshake error from ip_node5:3721: remote error: tls: bad certificate"

I guess my cluster is dead.

pi@noldork3sM1:~ $ kubectl get nodes -o wide Unable to connect to the server: x509: certificate signed by unknown authority