k3s: Unable to join an updated master node.
Environmental Info: K3s Version: HA installation mixed version 1.19 & 1.20
k3s version v1.19.5+k3s2 (746cf403) (Master 1) k3s version v1.20.0+k3s2 (2ea6b163) (Master 2, which doesn’t works) go version go1.15.5 (Master 2, which doesn’t works) Node(s) CPU architecture, OS, and Version:
Linux noldork3sM1 5.4.79-v7+ #1373 SMP Mon Nov 23 13:22:33 GMT 2020 armv7l GNU/Linux Linux noldork3sM2 5.4.79-v7+ #1373 SMP Mon Nov 23 13:22:33 GMT 2020 armv7l GNU/Linux Linux noldork3sN1 5.4.79-v7+ #1373 SMP Mon Nov 23 13:22:33 GMT 2020 armv7l GNU/Linux (Worker node, k3s versionk3s version v1.20.0+k3s2 (2ea6b163) go version go1.15.5. Joined and working) Cluster Configuration:
2xMaster 4xWorkers Describe the bug:
After updated noldork3sM2, the node is always on NoReady status noldork3sm2 NotReady control-plane,master Steps To Reproduce:
- Installed K3s:
export K3S_KUBECONFIG_MODE="644" export INSTALL_K3S_EXEC=" --tls-san noldork3s.noldor.local --no-deploy servicelb --disable traefik --node-taint k3s-controlplane=true:NoSchedule --datastore-endpoint mysql://user:password@tcp(server:3306)/database" curl -sfL https://get.k3s.io | shExpected behavior:
The node would join to the cluster successfully Actual behavior:
Stuck in NoReady Additional context / logs:
`Jan 8 12:15:30 localhost k3s[439]: E0108 12:15:30.212199 439 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
Jan 8 12:15:36 localhost k3s[439]: E0108 12:15:36.428889 439 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 8 12:15:36 localhost k3s[439]: E0108 12:15:36.707739 439 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/noldork3sm2?timeout=10s": context deadline exceeded
Jan 8 12:15:38 localhost k3s[439]: E0108 12:15:38.260039 439 leaderelection.go:325] error retrieving resource lock kube-system/cloud-controller-manager: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io cloud-controller-manager)
Jan 8 12:15:42 localhost k3s[439]: E0108 12:15:42.794165 439 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 8 12:15:43 localhost k3s[439]: E0108 12:15:43.086538 439 storage_flowcontrol.go:137] failed creating mandatory flowcontrol settings: failed getting mandatory FlowSchema exempt due to the server was unable to return a response in the time allotted, but may still be processing the request (get flowschemas.flowcontrol.apiserver.k8s.io exempt), will retry later
Jan 8 12:15:44 localhost k3s[439]: E0108 12:15:44.760922 439 repair.go:118] unable to refresh the service IP block: the server was unable to return a response in the time allotted, but may still be processing the request (get services)
Jan 8 12:15:44 localhost k3s[439]: E0108 12:15:44.764442 439 repair.go:75] unable to refresh the port block: the server was unable to return a response in the time allotted, but may still be processing the request (get services)
`
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 40 (5 by maintainers)
@nicklasfrahm You’re my fucking hero 😄, as you recommended:
kubectl delete secret coredns-token-cj77r -n kube-systemkubectl delete pod coredns-66c464876b-25kl9 -n kube-systemMany thanks !
There is a known issue with v1.20 on arm caused by a golang compiler issue. v1.20.2 will include a workaround: https://github.com/kubernetes/kubernetes/issues/97685
@i5Js I am not exactly a Kubernetes expert, but as @brandond said it could be something with the service account tokens. I am not sure if this might mess with stuff, but you could try to delete the service account token of the coredns pod and recreate it.
The service account token is stored in a secret
kubectl -n kube-system get secrets | grep "coredns". I don’t know the commands from the top of my head, but you might find stuff online. Later during the day I might have time to write up the commands.https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service
Hello @brandond Thanks the info. Since I’m unable to restore my actual cluster due to certificate issue, even after downgraded the masters to 1.19, I’m going to reinstall it to 1.19 and wait till future versions.
Well I’ve deployed a new cluster with VMs, all Debian 10 2 Masters and 2 Workers. I’ve upgraded them, remove one node, rebooted another, and everything worked as expected. There are only two differences with my pi’s installations: The masters have not any taint configuration, and the kernel is 4 instead 5…
No reason to be sorry, I tend to get lost in technical jibberish. Sorry for that, it is not very inclusive. CI (continuous integration) “is the practice of merging all developers’ working copies to a shared mainline several times a day”, whereas CD (continous delivery ) is an approach to “produce software in short cycles, ensuring that the software can be reliably released at any time” according to Wikipedia. This is often done by automating your builds, tests and deployments via for example GitHub actions or GitLab CI / CD. 😄 🚀
Awesome! I’m going to check it, I see you have also Armbian, like me.
Sorry for ask, but what means CI / CD?
No, I’m afraid not… I’m going to create a new clusters with VMs this time, try to upgrade them, and see if there are no issues. I’m a bit tired to recreate the cluster one time and another in my raspberrys
So, downgrade from 1.20 to 1.19 right?
I am not sure about that. Try the following:
sudo systemctl stop k3son all nodesBINARY=k3s-armhf sudo curl -SL https://github.com/rancher/k3s/releases/download/v1.19.5+k3s2/$BINARY -o /usr/local/bin/k3s, while making sure that theBINARYvariable is set to:k3s-armhffor$(uname -m) == "armv7l"k3s-arm64for$(uname -m) == "aarch64"k3sfor$(uname -m) == "x86_64"sudo systemctl restart k3son the control plane nodesv1.19.5+k3s2by runningwatch kubectl get nodessudo systemctl restart k3son all worker nodesI’ve restarted the services on node 1 and now I’m getting certificate issues all the time:
TLS handshake error from ip_node1:3721: remote error: tls: bad certificate"TLS handshake error from ip_node2:3721: remote error: tls: bad certificate"TLS handshake error from ip_node3:3721: remote error: tls: bad certificate"TLS handshake error from ip_node4:3721: remote error: tls: bad certificate"TLS handshake error from ip_node5:3721: remote error: tls: bad certificate"I guess my cluster is dead.
pi@noldork3sM1:~ $ kubectl get nodes -o wideUnable to connect to the server: x509: certificate signed by unknown authority