rancher: [2.2.9] Rancher container restarting every 12 seconds, expired certificates

What kind of request is this (question/bug/enhancement/feature request):

Bug

Steps to reproduce (least amount of steps as possible):

  • Install Rancher v2.0.0, upgrade to v2.0.2 -> v2.0.4 -> v2.0.8
  • Upgrade to v2.1.6
  • One year after Rancher v2.0.0 was installed, certificates expire and cluster becomes “unavailable”
  • Upgrade to v2.1.9; did not fix certificate expiry/rotation issue
  • Upgrade to v2.2.2, certificated rotated and cluster is available again, everything working
  • One year after Rancher v2.2.2 was installed, the Rancher Server UI become unavailable due to the container restarting every 12 seconds
  • Perform a backup of /var/lib/rancher, two certs inside the backup are expired and Rancher does not auto renew them;
    • /var/rancher/lib/state-management/tls/localhost.crt
    • /var/rancher/lib/state-management/tls/token-node.crt

(I think you could simulate the above timeline by setting the system clock to a date in the past and then moving it forward at the appropriate time to reproduce a ~1 year jump).

Result:

Running Rancher v2.2.9 as a single Docker container install, the Rancher Server UI becomes unavailable (“connection refused” in the browser) and the container is restarting every 12 seconds. Rancher is unusable.

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): rancher/rancher v2.2.9
  • Installation option (single install/HA): Single install (Docker container)

Possible Workarounds:

Workaround 1)

Set the system clock to a date in the past so that the certificate is not seen as expired. For me, on an Ubuntu server, that was achievable by disabling NTP and then setting the date and time manually;

sudo timedatectl set-ntp off
sudo date --set="2020-05-05 09:03:00.000"

This allowed the container to start up correctly and the Rancher Server UI was usable again, but this is only a short term workaround at best.

Workaround 2)

NOTE: I’m not advocating anyone use these commands on their particular installation, I’m just providing it as feedback for review by Rancher staff, because for me it solved the issue I was having…

This workaround was suggested to me by a community member on Rancher’s Slack.

rm /etc/kubernetes/ssl/*
rm /var/lib/rancher/management-state/certs/bundle.json
rm /var/lib/rancher/management-state/tls/token-node.crt
rm /var/lib/rancher/management-state/tls/localhost.crt

Inside the rancher container I did not have a /etc/kubernetes/ssl directory so I could not run that first command. The other three files did exist (and were originally visible inside the backup of /var/lib/rancher).

Actual command I ran to remove the files (NOTE: again, please don’t take this as advice, I’m just providing it for reference);

sudo docker exec -it acd7 sh -c "rm /var/lib/rancher/management-state/certs/bundle.json; rm /var/lib/rancher/management-state/tls/token-node.crt; rm /var/lib/rancher/management-state/tls/localhost.crt"

Then I enabled NTP again with sudo timedatectl set-ntp on to set the system clock back to the real/current time, and restarted the container with sudo docker restart acd7. Rancher started up correctly and was available again, clusters were visible (two AWS EC2 clusters attached to this server).

Other details that may be helpful:

Images on server

$ sudo docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
busybox             latest              020584afccce        6 months ago        1.22 MB
rancher/rancher     v2.2.9              944b5893d458        6 months ago        483 MB
rancher/rancher     v2.1.9              9a79850e485c        12 months ago       541 MB
rancher/rancher     v2.2.2              cb5cf64e84cc        12 months ago       495 MB
alpine              latest              caf27325b298        15 months ago       5.53 MB
rancher/rancher     v2.1.6              d14ff1038a54        15 months ago       542 MB
rancher/rancher     v2.0.8              817b51fbc1fc        20 months ago       529 MB
rancher/rancher     v2.0.4              975f0d475e47        22 months ago       530 MB
rancher/rancher     v2.0.2              88526c7bea4e        23 months ago       521 MB
rancher/rancher     v2.0.0              3141e5c66ee8        2 years ago         535 MB

Rancher Logs

When the problem first occuredRancher starts up then shows many “bad certificate”/“certificate has expired or is not yet valid” errors;

2020/05/07 07:15:22 [INFO] Rancher version v2.2.9 is starting
2020/05/07 07:15:22 [INFO] Rancher arguments {ACMEDomains:[redacted] AddLocal:auto Embedded:false KubeConfig: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false NoCACerts:false ListenConfig:<nil> AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0}
2020/05/07 07:15:22 [INFO] Listening on /tmp/log.sock
2020/05/07 07:15:22 [INFO] Running etcd --data-dir=management-state/etcd
...
I0507 07:15:24.805853       5 naming_controller.go:284] Starting NamingConditionController
I0507 07:15:24.805873       5 establishing_controller.go:73] Starting EstablishingController
2020/05/07 07:15:24 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.815346 I | http: TLS handshake error from 127.0.0.1:43826: remote error: tls: bad certificate
2020-05-07 07:15:24.828573 I | http: TLS handshake error from 127.0.0.1:43876: remote error: tls: bad certificate
E0507 07:15:24.856329       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.857714       5 reflector.go:134] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:178: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.861677       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.862446       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.863244       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.863976 I | http: TLS handshake error from 127.0.0.1:43888: remote error: tls: bad certificate
E0507 07:15:24.864317       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
...
E0507 07:15:33.926893       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:33.926916 I | http: TLS handshake error from 127.0.0.1:44320: remote error: tls: bad certificate
E0507 07:15:33.932574       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:33.932599 I | http: TLS handshake error from 127.0.0.1:44324: remote error: tls: bad certificate
2020-05-07 07:15:34.822709 I | http: TLS handshake error from 127.0.0.1:44328: remote error: tls: bad certificate
2020-05-07 07:15:34.825263 I | http: TLS handshake error from 127.0.0.1:44332: remote error: tls: bad certificate
F0507 07:15:34.825392       5 controllermanager.go:184] error building controller context: failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: Get https://localhost:6443/healthz?timeout=32s: x509: certificate has expired or is not yet valid

I also have a copy of the logs showing the first start up after Workaround 2 above was performed, I can provide this on request if needed.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 43
  • Comments: 81

Most upvoted comments

I was able to solve this issue by following the steps from @dnauck, but I had to do some other tweaks as the Rancher version I was using also provisioned an k3s environment inside the Rancher docker container on where the K8s secrets were saved.

  1. sudo timedatectl set-ntp off
  2. sudo date --set=“2021-03-30 09:03:00.000” (date before expiration)
  3. sudo docker exec -it rancher sh -c “rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json”
  4. sudo docker exec -it rancher /bin/bash (running the kubectl inside the docker container for K3s)
  5. kubectl delete secret -n kube-system k3s-serving --insecure-skip-tls-verify (to bypass the expired certificate)
  6. sudo timedatectl set-ntp on
  7. sudo docker restart rancher

Soon after reviewing the logs (docker logs rancher --tail=100) I was able to confirm the cert was generated again and I was able to access the UI successfully.

I had the same issue on Rancher 2.5.7, but none of the provided workaround here was working.

Got a new workaround from Leo on Slack:

  1. sudo timedatectl set-ntp off
  2. sudo date --set=“2021-03-30 09:03:00.000” (date before expiration)
  3. sudo docker exec -it rancher sh -c “rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json”
  4. kubectl delete secret -n kube-system k3s-serving (from the rancher cluster manager ui via Kubectl button)
  5. sudo timedatectl set-ntp on
  6. sudo docker restart rancher

Well I had to dive further into this today in order to get our Rancher instance back up and running.

Unfortunately the steps from @OneideLuizSchneider wouldn’t work for me on Rancher 2.3.3, K3s would fail to start (as seen above) when restarting the container. But I did manage to solve it (by upgrading to 2.3.9 and then 2.4.8) using the following steps:

  1. Take a back-up of your rancher instance just in case
  2. sudo timedatectl set-ntp off
  3. sudo date --set="2020-10-16 09:03:00.000" (or whenever your certificate is still just valid, this to be less than 90 days from expiring)
  4. sudo docker stop rancher (v2.3.x)
  5. Upgrade to Rancher v2.3.9 (let it start, catch up etc).
  6. sudo docker stop rancher (v2.3.9)
  7. Upgrade to Rancher v2.4.8 (give it a few minutes to start, catch up etc).
  8. sudo docker stop rancher (v2.4.8)
  9. sudo timedatectl set-ntp on
  10. Start your Rancher v2.4.8 instance again

There’s no need to remove localhost.crt and token-node.crt in this process. The reason this works (I believe) is because Rancher uses K3s under the hood. Rancher 2.3.x uses K3s v0.8.0. However automatic certificate renewal wasn’t introduced until K3s v0.10.0 (see: https://github.com/rancher/k3s/issues/1621). Rancher v2.4.8 however uses k3s version v1.17.2+k3s1 (cdab19b0) which will thus renew the certificates on boot.

In case somebody else runs into this issue with 2.3.x and can’t upgrade: Don’t attempt replacing the k3s binary inside the rancher container with v0.10.0, it won’t work. It fails with a -storage-backend flag provided but not defined error.

After I read through this thread, and this https://github.com/rancher/rancher/issues/31404, and posting the issue in Rancher Slack, I’ve got very clear solution from David Holder. No need to change the time setting or removing the certificate under /var/lib/rancher/k3s/server/tls, since all those certificate is not expired yet, and no need to upgrade. The only expired certificate was the one on the k3s cluster, k3s-serving secret.

My Rancher setup: single instance using docker, version 2.4.5 In v2.4.5, there is no tls inside /var/lib/rancher/management-state

sudo docker exec -it <container id> sh -c "rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json"
sudo docker exec -it <container id> k3s kubectl --insecure-skip-tls-verify=true delete secret -n kube-system k3s-serving
sudo docker restart <container id>

The key important here is --insecure-skip-tls-verify=true, since the certificate is expired, I cannot access any resource inside the k3s, let alone delete.

Now, my rancher UI is working as it is before.

Hope this help

I just got the same issue here on my cluster. Steps:

  1. sudo timedatectl set-ntp off
  2. sudo date --set="2020-07-11 09:03:00.000"
  3. sudo docker exec -it rancher sh -c "rm /var/lib/rancher/management-state/tls/token-node.crt; rm /var/lib/rancher/management-state/tls/localhost.crt"
  4. sudo timedatectl set-ntp on
  5. sudo docker restart rancher

Thanks @justincarter

We just ran into this issue with Rancher 2.3.3. The fix for us was effectively the solution posted by @cjohn001. We only had to remove the k3s internal certificates and not bother touching anything else (nor did we need to fiddle with system time/date).

Posting the steps taken if Rancher is running as a docker container in case someone else finds this handy. Regardless of how Rancher was running the gist for us was to create a backup and ensure everything under k3s/server/tls was empty:

docker stop rancher-server
docker start rancher-server
docker exec -it rancher-server sh -c "mv k3s/server/tls k3s/server/tls.bak"
docker logs --tail 3 rancher-server
# Something similar to the below will appear: 
# 2021/01/03 03:07:01 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate signed by unknown authority
# 2021/01/03 03:07:03 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate signed by unknown authority
# 2021/01/03 03:07:05 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate signed by unknown authority
docker stop rancher-server
docker start rancher-server

This worked for us (Rancher 2.5.2)

# delete certificate template to force re-generation
sudo docker exec -it rancher sh -c "rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json"

# delete the currently deployed cert
sudo docker exec -it rancher k3s kubectl delete secret -n kube-system k3s-serving

# restart rancher, this triggers the cert re-generation and brings rancher back to life
sudo docker restart rancher

After this, Rancher Server + UI went back to normal, just as if nothing has ever happend.

If anyone has a similar problem on v2.5.8, I had to delete the expired serving-cert and cattle-webhook-tls certificates in the local cluster’s System project under cattle-system namespace, restart the Rancher container and wait for several minutes.

Thanks to rancher, I’m now off to learning a new kubernetes cluster, just need to figure out which one ! I’m posting this just so people realize not everyone was successful fixing this issue, so if that’s your case, you’re not alone ! I used rancher for years and was always pretty impressed by it, but this one killed it for me.

Our Rancher UI became available again after following @cepefernando’s post - but the clusters still remained in the “unavailable” state, due to the certificate still being expired - it somehow did not rotate as we imagined it should do (this issue https://github.com/rancher/rancher/issues/14731). To fix the problem we did the following work-a-round:

  1. Create your own certificates with a CA, you will need the following files cert.key, cert.pem and cacerts.pem, the key should not be password protected. Either do a 10 year self signed or use whatever you do normally. The certificates should include the IP: “cluster ip” and DNS: “cluster name” and all other known SAN alias’es in the CSR. If this is not done correctly, kubectl will not allow a secure connection to the cluster.
  2. Copy the certificates to the server running rancher in /somedir/certs/
  3. Run sudo docker exec -it rancher sh -c "rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json
  4. Run sudo docker exec -it rancher k3s kubectl delete secret -n kube-system k3s-serving
  5. Stop and remove the running rancher container
  6. Mount the certs to the rancher container and run it, like this: docker run -d --name=rancher --restart=unless-stopped -p 443:443 -v /somedir/rancher:/var/lib/rancher -v /somedir/certs/cert.key:/etc/rancher/ssl/key.pem -v /somedir/certs/cert.pem:/etc/rancher/ssl/cert.pem -v /somedir/certs/cacerts.pem:/etc/rancher/ssl/cacerts.pem rancher/rancher:v2.X.X
  7. Now you should be able to login to the UI and see the certificate has changed
  8. Navigate to the Cluster -> Nodes -> Edit Cluster -> copy the join cluster string, remember to select the appropriate roles
  9. Logon to a node, that is joined to the cluster and paste in the “join string” and wait for it to re-create the agent and kubelet containers
  10. Your cluster should now become available
  11. Navigate to the Cluster -> Nodes -> Edit Cluster -> there is a --ca-checksum string in the join cluster command, copy it.
  12. Edit the cattle-node-agent and cattle-cluster-agent, in the variables paste the new ca-checksum string from before and save
  13. Your cluster should be ready with the updated CA

Hello together, I assume I have the same problem here with rancher 2.3.1.

https://forums.rancher.com/t/rancher-container-keeps-restarting/18654

Unfortunately I do have the tls certficate folder under /var/lib/rancher/management-state/tls

However, I found certificates under the following path sudo ls k3s/server/tls Should I delete all those certs and would I have to delete *.crt, *.key and the temporary-certs folder?

By the way I am running on RancherOS there seems to be no timedatectl command available. How can I install it?

client-admin.crt client-auth-proxy.key client-controller.crt client-kube-apiserver.key client-kubelet.key request-header-ca.crt server-ca.key serving-kube-apiserver.key client-admin.key client-ca.crt client-controller.key client-kube-proxy.crt client-scheduler.crt request-header-ca.key service.key serving-kubelet.key client-auth-proxy.crt client-ca.key client-kube-apiserver.crt client-kube-proxy.key client-scheduler.key server-ca.crt serving-kube-apiserver.crt temporary-certs

Ok, solved the problem by deleting all files and the temporary-certs folder in the tls folder and than did a server restart. Running containers needed to be restarted in order to get things up again.

As a feature request, I’d appreciate if the Rancher API would contain the cluster certificate expiry dates. Currently the API shows when the cluster was created (“created”) but no information about the Kubernetes (-internal) certificates. This would extremely help to add this into monitoring (e.g. with check_rancher2 monitoring plugin).

Same here! The following helped:

  • sudo docker exec -it rancher sh -c “rm /var/lib/rancher/management-state/tls/token-node.crt; rm /var/lib/rancher/management-state/tls/localhost.crt”

For anyone still running Rancher as Single Node Docker container, I’ve just published a guide on how you can migrate your Single Node Docker installation to a (HA) Kubernetes / K3s cluster on my blog. Unlike the documentation of Rancher, this guide will actually cover every step of the way to get you migrated, along with the hoops you’ll have to jump to because Rancher is running as a container on Docker.

This migration didn’t used to be possible prior to v2.5.0, but it’s officially supported since (though as mentioned in the guide, I would advice on running versions prior to v2.5.8 if you still need to upgrade).

It also covers automatically restarting K3s (using a CRON job) every 14 days so your certificates will always be renewed in time. It won’t fix your current issue, but hopefully will help you avoid running into it in the future.

Ok have been trying to renew my Certificate for a week now, but nothing worked. There was one little secret missing:

Rancher Version v2.5.9

Had to delete dynamic-cert.json, k3s-serving secret in kube-system namespace and serving-cert secret in cattle-system namespace. Restart Container after this and done.

Beyond annoying. Look at how old this issue is.

After restoring the Rancher UI instance, I faced another issue, the cluster stuck at updating status image

The logs of the rancher

2021/06/30 15:50:42 [INFO] Provisioning cluster [c-mdbtk]
2021/06/30 15:50:42 [INFO] Restoring cluster [c-mdbtk] from backup
2021/06/30 15:50:47 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:43581
2021/06/30 15:50:47 [INFO] Restoring etcd snapshot c-mdbtk-rl-hzzrp_2021-06-14T10:50:04Z
2021/06/30 15:50:47 [INFO] Successfully Deployed state file at [management-state/rke/rke-984197598/cluster.rkestate]
2021/06/30 15:50:47 [INFO] [dialer] Setup tunnel for host [192.168.199.5]
2021/06/30 15:50:47 [WARNING] Failed to set up SSH tunneling for host [192.168.199.5]: Can't establish dialer connection: can not build dialer to [c-mdbtk:m-cc1ad23d7744]
2021/06/30 15:50:47 [WARNING] Removing host [192.168.199.5] from node lists
2021/06/30 15:50:47 [INFO] kontainerdriver rancherkubernetesengine stopped

Can anyone help?

In my case, we were running rancher as a single node docker container using let’s encrypt certificate (–acme-domain install). The container is hosted on an Ubuntu VM running in Azure. There was an inbound rule on the NSG of this VM to deny all HTTP traffic (port 80). This was disallowing the certificate renewal. Deleted this rule and added a new one to allow traffic temporarily. After this, forcing a certificate regeneration by following the steps from @GameScripting solved it for me.

For those who use the let’s encrypt certificate from Rancher, ensure that there is no firewall appliance or resource like NSG that is blocking traffic on port 80/443 during the certificate renewal. Whitelisting the required ip addresses is an option. In my case, I just temporarily allowed all HTTP traffic to flow to the VM and later blocked it again after the renewal was completed.

Below command can be used to verify the status of the certificate. openssl s_client -connect localhost:443 -showcerts </dev/null 2>&1 | openssl x509 -noout -startdate -enddate

Just putting this here so that it helps someone in the future.

Here is my case solved:

  1. Step I ssh to your rancher server rm -Rf /var/lib/rancher-volume/k3s/server/tls #[rancher-volume] is the name volume that I name it when setup rancher: docker run -d -p 20443:443 --name=rancher-master --restart=unless-stopped -v /var/lib/rancher-volume/:/var/lib/rancher

  2. Step II remove the current rancher-container docker rm -f rancher-master #[rancher-master] name of my rancher container

  3. Step III Re-create rancher-container docker run -d -p 20443:443 --name=rancher-master --restart=unless-stopped -v /var/lib/rancher-volume/:/var/lib/rancher rancher:stable #same version you deleted or your_version needed

  4. Step IV sudo reboot wait for 5 mins it back alive 👍

I also hit this issue on 2.3.1. There is no file on /var/lib/rancher/management-state/tls Reading this git point us on : k3s/server/tls dir. Should I delete all those certs and would I have to delete *.crt, *.key and the temporary-certs folder? Or simply upgrade rancher as suggested ?

Not really clear.

I’m having the same issue as @amagura in #29475 , tried the same thing that @OneideLuizSchneider describes in Rancher v2.3.3, it still fails over the k3s certificates being bad before it even gets to refreshing them I guess? Anything else I should delete in order to have it regenerate them?

Logs:

2020/10/17 17:21:48 [INFO] Rancher version v2.3.3 is starting
2020/10/17 17:21:48 [INFO] Rancher arguments {ACMEDomains:[rancher.example.com] AddLocal:auto Embedded:false KubeConfig: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false NoCACerts:false ListenConfig:<nil> AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features:}
2020/10/17 17:21:48 [INFO] Listening on /tmp/log.sock
2020/10/17 17:21:48 [INFO] Running etcd --data-dir=management-state/etcd
2020-10-17 17:21:48.866613 W | pkg/flags: unrecognized environment variable ETCD_URL_arm64=https://github.com/etcd-io/etcd/releases/download/v3.3.14/etcd-v3.3.14-linux-arm64.tar.gz
2020-10-17 17:21:48.866720 W | pkg/flags: unrecognized environment variable ETCD_URL_amd64=https://github.com/etcd-io/etcd/releases/download/v3.3.14/etcd-v3.3.14-linux-amd64.tar.gz
2020-10-17 17:21:48.866868 W | pkg/flags: unrecognized environment variable ETCD_UNSUPPORTED_ARCH=amd64
2020-10-17 17:21:48.866910 W | pkg/flags: unrecognized environment variable ETCD_URL=ETCD_URL_amd64
2020-10-17 17:21:48.866953 I | etcdmain: etcd Version: 3.3.14
2020-10-17 17:21:48.866990 I | etcdmain: Git SHA: 5cf5d88a1
2020-10-17 17:21:48.867025 I | etcdmain: Go Version: go1.12.9
2020-10-17 17:21:48.867045 I | etcdmain: Go OS/Arch: linux/amd64
2020-10-17 17:21:48.867079 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2020-10-17 17:21:48.867317 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-10-17 17:21:48.867780 I | embed: listening for peers on http://localhost:2380
2020-10-17 17:21:48.867931 I | embed: listening for client requests on localhost:2379
2020-10-17 17:21:48.961910 I | etcdserver: recovered store from snapshot at index 96200988
2020-10-17 17:21:48.964917 I | mvcc: restore compact to 83592395
2020-10-17 17:21:49.043462 I | etcdserver: name = default
2020-10-17 17:21:49.043594 I | etcdserver: data dir = management-state/etcd
2020-10-17 17:21:49.043682 I | etcdserver: member dir = management-state/etcd/member
2020-10-17 17:21:49.043806 I | etcdserver: heartbeat = 100ms
2020-10-17 17:21:49.043931 I | etcdserver: election = 1000ms
2020-10-17 17:21:49.044005 I | etcdserver: snapshot count = 100000
2020-10-17 17:21:49.044131 I | etcdserver: advertise client URLs = http://localhost:2379
2020-10-17 17:21:49.407153 I | etcdserver: restarting member e92d66acd89ecf29 in cluster 7581d6eb2d25405b at commit index 96246578
2020-10-17 17:21:49.411002 I | raft: e92d66acd89ecf29 became follower at term 8246
2020-10-17 17:21:49.411040 I | raft: newRaft e92d66acd89ecf29 [peers: [e92d66acd89ecf29], term: 8246, commit: 96246578, applied: 96200988, lastindex: 96246578, lastterm: 8246]
2020-10-17 17:21:49.411207 I | etcdserver/api: enabled capabilities for version 3.3
2020-10-17 17:21:49.411231 I | etcdserver/membership: added member e92d66acd89ecf29 [https://127.0.0.1:2380] to cluster 7581d6eb2d25405b from store
2020-10-17 17:21:49.411241 I | etcdserver/membership: set the cluster version to 3.3 from store
2020-10-17 17:21:49.417348 I | mvcc: restore compact to 83592395
2020-10-17 17:21:49.490531 W | auth: simple token is not cryptographically signed
2020-10-17 17:21:49.499558 I | etcdserver: starting server... [version: 3.3.14, cluster version: 3.3]
2020-10-17 17:21:49.499908 I | etcdserver: e92d66acd89ecf29 as single-node; fast-forwarding 9 ticks (election ticks 10)
2020-10-17 17:21:49.911640 I | raft: e92d66acd89ecf29 is starting a new election at term 8246
2020-10-17 17:21:49.911921 I | raft: e92d66acd89ecf29 became candidate at term 8247
2020-10-17 17:21:49.911980 I | raft: e92d66acd89ecf29 received MsgVoteResp from e92d66acd89ecf29 at term 8247
2020-10-17 17:21:49.912056 I | raft: e92d66acd89ecf29 became leader at term 8247
2020-10-17 17:21:49.912083 I | raft: raft.node: e92d66acd89ecf29 elected leader e92d66acd89ecf29 at term 8247
2020-10-17 17:21:49.916267 I | embed: ready to serve client requests
2020-10-17 17:21:49.916423 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2379]} to cluster 7581d6eb2d25405b
2020-10-17 17:21:49.917087 N | embed: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
2020/10/17 17:21:49 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: dial tcp 127.0.0.1:6443: connect: connection refused
time="2020-10-17T17:21:49.985874002Z" level=info msg="Starting k3s v0.8.0 (f867995f)"
time="2020-10-17T17:21:49.997135165Z" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --api-audiences=unknown --authorization-mode=Node,RBAC --basic-auth-file=/var/lib/rancher/k3s/server/cred/passwd --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-servers=http://localhost:2379 --insecure-port=0 --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
E1017 17:21:50.004625      31 prometheus.go:138] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.005390      31 prometheus.go:150] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.005594      31 prometheus.go:162] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.005684      31 prometheus.go:174] failed to register work_duration metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.005775      31 prometheus.go:189] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.005875      31 prometheus.go:202] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
W1017 17:21:50.176215      31 genericapiserver.go:315] Skipping API batch/v2alpha1 because it has no resources.
W1017 17:21:50.186381      31 genericapiserver.go:315] Skipping API node.k8s.io/v1alpha1 because it has no resources.
E1017 17:21:50.219027      31 prometheus.go:138] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.219070      31 prometheus.go:150] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.219200      31 prometheus.go:162] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.219250      31 prometheus.go:174] failed to register work_duration metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.219289      31 prometheus.go:189] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
E1017 17:21:50.219316      31 prometheus.go:202] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
time="2020-10-17T17:21:50.227202599Z" level=info msg="Running kube-scheduler --bind-address=127.0.0.1 --kubeconfig=/var/lib/rancher/k3s/server/cred/scheduler.kubeconfig --port=10251 --secure-port=0"
time="2020-10-17T17:21:50.227582783Z" level=info msg="Running kube-controller-manager --allocate-node-cidrs=true --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-cert-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --cluster-signing-key-file=/var/lib/rancher/k3s/server/tls/server-ca.key --kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --port=10252 --root-ca-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --secure-port=0 --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --use-service-account-credentials=true"
W1017 17:21:50.234575      31 authorization.go:47] Authorization is disabled
W1017 17:21:50.234595      31 authentication.go:55] Authentication is disabled
time="2020-10-17T17:21:50.250114544Z" level=fatal msg="starting tls server: Get https://localhost:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: x509: certificate has expired or is not yet valid"
2020/10/17 17:21:50 [FATAL] k3s exited with: exit status 1

If I start Rancher after deleting the files, but without setting the back to ‘now’, it will start succesfully, but won’t regenerate them either.

Yes they’re definitely gone:

# sudo docker exec -it 53e26 sh -c "rm /var/lib/rancher/management-state/tls/token-node.crt; rm /var/lib/rancher/management-state/tls/localhost.crt"
rm: cannot remove '/var/lib/rancher/management-state/tls/token-node.crt': No such file or directory
rm: cannot remove '/var/lib/rancher/management-state/tls/localhost.crt': No such file or directory

I’m encountering this issue on Rancher v2.3.6. Removing token-node.crt and localhost.crt does not resolve the issue for me.

Following @OneideLuizSchneider’s advice and deleting the bad certs, updating the time and restarting the container fixed it for me. Just was a different set of certs in my case.

@ArgonV the certs in question seem to be generated for some sort of proxy internal to Rancher. If you dump out the certs with openssl you’ll find CN=kubernetes and SANs DNS:kubernetes.default.svc, DNS:kubernetes.default, DNS:kubernetes, DNS:localhost, IP Address:127.0.0.1, IP Address:10.43.0.1.

This drove me nuts all day, literally everything but Rancher was working fine and had certs valid until 2029.

Can confirm that stopping the container, deleting token-node.crt and localhost.crt, and starting the container back up caused the certificates to be regenerated.

I just got the bad tls: bad certificate today, taking the clock back a couple of hours and deleting /var/lib/rancher/management-state/tls/localhost.crt and /var/lib/rancher/management-state/tls/token-node.crt followed by a restart of the container did the trick

@justincarter I restored snapshots for the etcd and rancher master VMs. Then I applied your solution and worked like a charm! The cluster was stuck rotating certificates

I’m having the same issue.

These steps worked for me, running rancher version 2.3.11 on docker:

  1. Stop rancher container (e.g. docker stop my-rancher)
  2. Remove the directory: rancher/k3s/server/tls (I have /var/lib/rancher mapped to a local volume) (e.g. rm -Rf rancher/k3s/server/tls)
  3. Remove the rancher container (e.g. docker rm my-rancher)
  4. Start a new rancher container (e.g. docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --memory-swappiness=60 --name my-rancher -v /opt/rancher/:/var/lib/rancher/ rancher/rancher:v2.3.11)

Thanks for all the above answers.

None of the solutions above really helped me.

In the end, removing the k3s/server/tls directory + upgrading rancher from 2.5.6 to 2.5.8 solved it.

The steps mentioned by @cepefernando worked! Thanks so much! https://github.com/rancher/rancher/issues/26984#issuecomment-818770519 This was with Rancher 2.4.5.

@cepefernando I also have the K3s subsystem. I removed the k3s certificate just via Rancher UI after I set the time a few days back. Worked like a charm.

You are my hero @dnauck just wasted 5 hours of debugging and trying everything possible, but only your solution worked!

Just to help all the other folks with old and busted clusters from real world scenarios 😉: Given: Rancher 3.5.0 restarts continously because of expired certificates.

My solution:

  • backup container files
  • I changed no time setting nor TLS files
  • upgrade to 3.5.10 (it fails with same error)
  • upgrade to 3.4.8 (keep it time to startup).

From now it works.

No nosso caso aqui o cluster parou de funcionar por causa do certificado e fizemos o seguinte para resolver.

docker restart ranchermaster docker exec -it ranchermaster sh -c “mv /var/lib/rancher/k3s/server/tls/ /var/lib/rancher/k3s/server/tls.bak” docker restart ranchermaster

My /var/lib/rancher/management-state/tls folder had 6 months+ expired certificates, so I assumed it was safe to delete. After a backup, of course.

Same happened with me. Thanks @justincarter your fix helped me.

Thanks @justincarter! I ran into the exact same issue today and this helped me fix my Rancher installation as well. In may case, I didn’t have the /etc/kubernetes/ssl/ directory or the /var/lib/rancher/management-state/certs/bundle.json file but deleting the other 2 and restarting worked!