k3s: Unable to connect to the server: x509: certificate signed by unknown authority - inconsistent behavior
Environmental Info: K3s Version:
# k3s -v
k3s version v1.19.7+k3s1 (5a00e38d)
Node(s) CPU architecture, OS, and Version:
- Linux k3s-ya-1 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Linux k3s-ya-2 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Linux k3s-ya-3 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 masters
Describe the bug:
# kubectl get nodes
Unable to connect to the server: x509: certificate signed by unknown authority
[root@k3s-ya-1 ~]# k3s kubectl get nodes
Unable to connect to the server: x509: certificate signed by unknown authority
[root@k3s-ya-1 ~]#
However, the result is inconsistent. Sometimes the first master node will work but 2nd and 3rd node Unable to connect to the server: x509: certificate signed by unknown authority
Steps To Reproduce:
- Installed K3s:
etcd certs are copied into /root
First node - k3s-ya-1
k3s-uninstall.sh
export INSTALL_K3S_VERSION=v1.19.7+k3s1
export K3S_DATASTORE_CAFILE=/root/ca.crt
export K3S_DATASTORE_CERTFILE=/root/apiserver-etcd-client.crt
export K3S_DATASTORE_KEYFILE=/root/apiserver-etcd-client.key
export K3S_KUBECONFIG_OUTPUT=/root/kube.confg
export K3S_DATASTORE_ENDPOINT=https://etcd1.k8s:2379,https://etcd2.k8s,https://etcd3.k8s:2379
k3s.install server
# kubectl get nodes
Unable to connect to the server: x509: certificate signed by unknown authority
^^^ this result is inconsistent - sometimes works, sometimes not
cat /var/lib/rancher/k3s/server/node-token to get token for use with additional nodes.
2nd node - k3s-ya-2
k3s-uninstall.sh
export INSTALL_K3S_VERSION=v1.19.7+k3s1
export K3S_DATASTORE_ENDPOINT=https://etcd1.k8s:2379,https://etcd2.k8s,https://etcd3.k8s:2379
export K3S_DATASTORE_CAFILE=/root/ca.crt
export K3S_DATASTORE_CERTFILE=/root/apiserver-etcd-client.crt
export K3S_DATASTORE_KEYFILE=/root/apiserver-etcd-client.key
export K3S_TOKEN=--from first node--
export K3S_URL=https://k3s:6443
export K3S_KUBECONFIG_OUTPUT=/root/kube.confg
k3s.install server
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k3s-ya-1 Ready control-plane,master 2d5h v1.19.7+k3s1
k3s-ya-2 Ready control-plane,master 36h v1.19.7+k3s1
^^^ this time it worked - last 3 attempts 2nd node didn’t work but the 1st node did - go figure.
3rd node
k3s-uninstall.sh export INSTALL_K3S_VERSION=v1.19.7+k3s1 export K3S_DATASTORE_ENDPOINT=https://etcd1.k8s:2379,https://etcd2.k8s,https://etcd3.k8.:2379 export K3S_DATASTORE_CAFILE=/root/ca.crt export K3S_DATASTORE_CERTFILE=/root/apiserver-etcd-client.crt export K3S_DATASTORE_KEYFILE=/root/apiserver-etcd-client.key export K3S_TOKEN=–from first node– export K3S_URL=https://k3s:6443 export K3S_KUBECONFIG_OUTPUT=/root/kube.confg k3s.install server
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k3s-ya-1 Ready control-plane,master 2d5h v1.19.7+k3s1
k3s-ya-2 Ready control-plane,master 36h v1.19.7+k3s1
k3s-ya-3 Ready master 19s v1.19.7+k3s1
^^^ more expected - 1/2 the time yields Unable to connect to the server: x509: certificate signed by unknown authority
Expected behavior:
Consistent behavior after k3s sever is installed. kubectl should work without certificate errors across all nodes.
Actual behavior:
Inconsistent. Some nodes Unable to connect to the server: x509: certificate signed by unknown authority others can. Uninstall and repeat - different results.
Yesterday entire cluster was working as expected no errors across all nodes with Rancher installed and running another cluster as expected.
Today, Unable to connect to the server: x509: certificate signed by unknown authority on every k3s node.
It’s almost like the certificates are playing musical chairs.
Additional context / logs:
Samples from /var/log/messages
Feb 8 20:55:42 k3s-ya-1 k3s: time="2021-02-08T20:55:42.901658387-05:00" level=info msg="Cluster-Http-Server 2021/02/08 20:55:42 http: TLS handshake error from 10.1.0.84:43082: remote error: tls: bad certificate"
Feb 8 20:55:43 k3s-ya-1 k3s: time="2021-02-08T20:55:43.012864767-05:00" level=info msg="Cluster-Http-Server 2021/02/08 20:55:43 http: TLS handshake error from 10.42.2.175:46490: remote error: tls: bad certificate"
Feb 8 20:56:37 k3s-ya-2 k3s: time="2021-02-08T20:56:37.629125982-05:00" level=info msg="Cluster-Http-Server 2021/02/08 20:56:37 http: TLS handshake error from 10.1.0.85:35180: remote error: tls: bad certificate"
Feb 8 20:56:37 k3s-ya-2 k3s: time="2021-02-08T20:56:37.840388714-05:00" level=info msg="Cluster-Http-Server 2021/02/08 20:56:37 http: TLS handshake error from 10.1.0.83:42518: remote error: tls: bad certificate"
Feb 8 20:57:49 k3s-ya-3 k3s: E0208 20:57:49.215716 829 event.go:273] Unable to write event: 'Patch "https://127.0.0.1:6443/api/v1/namespaces/kube-system/events/helm-install-traefik-4lncd.1661f1476f8d4e12": x509: certificate signed by unknown authority' (may retry after sleeping)
Feb 8 20:57:49 k3s-ya-3 k3s: time="2021-02-08T20:57:49.361122818-05:00" level=info msg="Connecting to proxy" url="wss://10.1.0.81:6443/v1-k3s/connect"
Feb 8 20:57:49 k3s-ya-3 k3s: time="2021-02-08T20:57:49.362442817-05:00" level=error msg="Failed to connect to proxy" error="x509: certificate signed by unknown authority"
Feb 8 20:57:49 k3s-ya-3 k3s: time="2021-02-08T20:57:49.362463456-05:00" level=error msg="Remotedialer proxy error" error="x509: certificate signed by unknown authority"
Feb 8 20:57:49 k3s-ya-3 k3s: time="2021-02-08T20:57:49.367212594-05:00" level=info msg="Connecting to proxy" url="wss://10.1.0.82:6443/v1-k3s/connect"
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (8 by maintainers)
I had the same error message now after uninstalling and re-installing K3S. Turns out the problem was my
~/.kube/configwas still referring to the old cluster. Delete that and thencp /etc/rancher/k3s/k3s.yaml ~/.kube/configto get the new context.You don’t need to set K3S_URL (–server) when using an external datastore; this is only for use when joining agents or using embedded etcd.
I am curious how you came to have two nodes with the control-plane role label. This wasn’t added until 1.20, yet your nodes are all still on 1.19. Did you upgrade temporarily, and then downgrade again?
In the past I have seen behavior like this when servers were all brought up at the same time and raced to bootstrap the cluster CA certs, or when nodes were started up with existing certs from a different cluster that they then try to use instead of the ones recognized by the rest of the cluster.
It sounds like these nodes have been through some odd things. I run my personal cluster with an external etcd and haven’t had any problems with it; I suspect something in the way you started up, upgrade, or grew this cluster has left it very confused about what certificates to use.