rancher: [2.2.9] Rancher container restarting every 12 seconds, expired certificates
What kind of request is this (question/bug/enhancement/feature request):
Bug
Steps to reproduce (least amount of steps as possible):
- Install Rancher v2.0.0, upgrade to v2.0.2 -> v2.0.4 -> v2.0.8
- Upgrade to v2.1.6
- One year after Rancher v2.0.0 was installed, certificates expire and cluster becomes “unavailable”
- Upgrade to v2.1.9; did not fix certificate expiry/rotation issue
- Upgrade to v2.2.2, certificated rotated and cluster is available again, everything working
- One year after Rancher v2.2.2 was installed, the Rancher Server UI become unavailable due to the container restarting every 12 seconds
- Perform a backup of /var/lib/rancher, two certs inside the backup are expired and Rancher does not auto renew them;
- /var/rancher/lib/state-management/tls/localhost.crt
- /var/rancher/lib/state-management/tls/token-node.crt
(I think you could simulate the above timeline by setting the system clock to a date in the past and then moving it forward at the appropriate time to reproduce a ~1 year jump).
Result:
Running Rancher v2.2.9 as a single Docker container install, the Rancher Server UI becomes unavailable (“connection refused” in the browser) and the container is restarting every 12 seconds. Rancher is unusable.
Environment information
- Rancher version (
rancher/rancher
/rancher/server
image tag or shown bottom left in the UI): rancher/rancher v2.2.9 - Installation option (single install/HA): Single install (Docker container)
Possible Workarounds:
Workaround 1)
Set the system clock to a date in the past so that the certificate is not seen as expired. For me, on an Ubuntu server, that was achievable by disabling NTP and then setting the date and time manually;
sudo timedatectl set-ntp off
sudo date --set="2020-05-05 09:03:00.000"
This allowed the container to start up correctly and the Rancher Server UI was usable again, but this is only a short term workaround at best.
Workaround 2)
NOTE: I’m not advocating anyone use these commands on their particular installation, I’m just providing it as feedback for review by Rancher staff, because for me it solved the issue I was having…
This workaround was suggested to me by a community member on Rancher’s Slack.
rm /etc/kubernetes/ssl/*
rm /var/lib/rancher/management-state/certs/bundle.json
rm /var/lib/rancher/management-state/tls/token-node.crt
rm /var/lib/rancher/management-state/tls/localhost.crt
Inside the rancher container I did not have a /etc/kubernetes/ssl
directory so I could not run that first command. The other three files did exist (and were originally visible inside the backup of /var/lib/rancher
).
Actual command I ran to remove the files (NOTE: again, please don’t take this as advice, I’m just providing it for reference);
sudo docker exec -it acd7 sh -c "rm /var/lib/rancher/management-state/certs/bundle.json; rm /var/lib/rancher/management-state/tls/token-node.crt; rm /var/lib/rancher/management-state/tls/localhost.crt"
Then I enabled NTP again with sudo timedatectl set-ntp on
to set the system clock back to the real/current time, and restarted the container with sudo docker restart acd7
. Rancher started up correctly and was available again, clusters were visible (two AWS EC2 clusters attached to this server).
Other details that may be helpful:
Images on server
$ sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
busybox latest 020584afccce 6 months ago 1.22 MB
rancher/rancher v2.2.9 944b5893d458 6 months ago 483 MB
rancher/rancher v2.1.9 9a79850e485c 12 months ago 541 MB
rancher/rancher v2.2.2 cb5cf64e84cc 12 months ago 495 MB
alpine latest caf27325b298 15 months ago 5.53 MB
rancher/rancher v2.1.6 d14ff1038a54 15 months ago 542 MB
rancher/rancher v2.0.8 817b51fbc1fc 20 months ago 529 MB
rancher/rancher v2.0.4 975f0d475e47 22 months ago 530 MB
rancher/rancher v2.0.2 88526c7bea4e 23 months ago 521 MB
rancher/rancher v2.0.0 3141e5c66ee8 2 years ago 535 MB
Rancher Logs
When the problem first occuredRancher starts up then shows many “bad certificate”/“certificate has expired or is not yet valid” errors;
2020/05/07 07:15:22 [INFO] Rancher version v2.2.9 is starting
2020/05/07 07:15:22 [INFO] Rancher arguments {ACMEDomains:[redacted] AddLocal:auto Embedded:false KubeConfig: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false NoCACerts:false ListenConfig:<nil> AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0}
2020/05/07 07:15:22 [INFO] Listening on /tmp/log.sock
2020/05/07 07:15:22 [INFO] Running etcd --data-dir=management-state/etcd
...
I0507 07:15:24.805853 5 naming_controller.go:284] Starting NamingConditionController
I0507 07:15:24.805873 5 establishing_controller.go:73] Starting EstablishingController
2020/05/07 07:15:24 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.815346 I | http: TLS handshake error from 127.0.0.1:43826: remote error: tls: bad certificate
2020-05-07 07:15:24.828573 I | http: TLS handshake error from 127.0.0.1:43876: remote error: tls: bad certificate
E0507 07:15:24.856329 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.857714 5 reflector.go:134] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:178: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.861677 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.862446 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.863244 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.863976 I | http: TLS handshake error from 127.0.0.1:43888: remote error: tls: bad certificate
E0507 07:15:24.864317 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
...
E0507 07:15:33.926893 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:33.926916 I | http: TLS handshake error from 127.0.0.1:44320: remote error: tls: bad certificate
E0507 07:15:33.932574 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:33.932599 I | http: TLS handshake error from 127.0.0.1:44324: remote error: tls: bad certificate
2020-05-07 07:15:34.822709 I | http: TLS handshake error from 127.0.0.1:44328: remote error: tls: bad certificate
2020-05-07 07:15:34.825263 I | http: TLS handshake error from 127.0.0.1:44332: remote error: tls: bad certificate
F0507 07:15:34.825392 5 controllermanager.go:184] error building controller context: failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: Get https://localhost:6443/healthz?timeout=32s: x509: certificate has expired or is not yet valid
I also have a copy of the logs showing the first start up after Workaround 2 above was performed, I can provide this on request if needed.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 43
- Comments: 81
I was able to solve this issue by following the steps from @dnauck, but I had to do some other tweaks as the Rancher version I was using also provisioned an k3s environment inside the Rancher docker container on where the K8s secrets were saved.
Soon after reviewing the logs (docker logs rancher --tail=100) I was able to confirm the cert was generated again and I was able to access the UI successfully.
I had the same issue on Rancher 2.5.7, but none of the provided workaround here was working.
Got a new workaround from Leo on Slack:
Well I had to dive further into this today in order to get our Rancher instance back up and running.
Unfortunately the steps from @OneideLuizSchneider wouldn’t work for me on Rancher 2.3.3, K3s would fail to start (as seen above) when restarting the container. But I did manage to solve it (by upgrading to 2.3.9 and then 2.4.8) using the following steps:
sudo timedatectl set-ntp off
sudo date --set="2020-10-16 09:03:00.000"
(or whenever your certificate is still just valid, this to be less than 90 days from expiring)sudo docker stop rancher
(v2.3.x)sudo docker stop rancher
(v2.3.9)sudo docker stop rancher
(v2.4.8)sudo timedatectl set-ntp on
There’s no need to remove
localhost.crt
andtoken-node.crt
in this process. The reason this works (I believe) is because Rancher uses K3s under the hood. Rancher 2.3.x uses K3s v0.8.0. However automatic certificate renewal wasn’t introduced until K3s v0.10.0 (see: https://github.com/rancher/k3s/issues/1621). Rancher v2.4.8 however usesk3s version v1.17.2+k3s1 (cdab19b0)
which will thus renew the certificates on boot.In case somebody else runs into this issue with 2.3.x and can’t upgrade: Don’t attempt replacing the k3s binary inside the rancher container with v0.10.0, it won’t work. It fails with a
-storage-backend
flag provided but not defined
error.After I read through this thread, and this https://github.com/rancher/rancher/issues/31404, and posting the issue in Rancher Slack, I’ve got very clear solution from David Holder. No need to change the time setting or removing the certificate under
/var/lib/rancher/k3s/server/tls
, since all those certificate is not expired yet, and no need to upgrade. The only expired certificate was the one on the k3s cluster,k3s-serving
secret.My Rancher setup: single instance using docker, version
2.4.5
In v2.4.5, there is notls
inside/var/lib/rancher/management-state
The key important here is
--insecure-skip-tls-verify=true
, since the certificate is expired, I cannot access any resource inside the k3s, let alonedelete
.Now, my rancher UI is working as it is before.
Hope this help
I just got the same issue here on my cluster. Steps:
sudo timedatectl set-ntp off
sudo date --set="2020-07-11 09:03:00.000"
sudo docker exec -it rancher sh -c "rm /var/lib/rancher/management-state/tls/token-node.crt; rm /var/lib/rancher/management-state/tls/localhost.crt"
sudo timedatectl set-ntp on
sudo docker restart rancher
Thanks @justincarter
We just ran into this issue with Rancher 2.3.3. The fix for us was effectively the solution posted by @cjohn001. We only had to remove the k3s internal certificates and not bother touching anything else (nor did we need to fiddle with system time/date).
Posting the steps taken if Rancher is running as a docker container in case someone else finds this handy. Regardless of how Rancher was running the gist for us was to create a backup and ensure everything under
k3s/server/tls
was empty:This worked for us (Rancher 2.5.2)
After this, Rancher Server + UI went back to normal, just as if nothing has ever happend.
If anyone has a similar problem on v2.5.8, I had to delete the expired
serving-cert
andcattle-webhook-tls
certificates in thelocal
cluster’s System project undercattle-system
namespace, restart the Rancher container and wait for several minutes.Thanks to rancher, I’m now off to learning a new kubernetes cluster, just need to figure out which one ! I’m posting this just so people realize not everyone was successful fixing this issue, so if that’s your case, you’re not alone ! I used rancher for years and was always pretty impressed by it, but this one killed it for me.
Our Rancher UI became available again after following @cepefernando’s post - but the clusters still remained in the “unavailable” state, due to the certificate still being expired - it somehow did not rotate as we imagined it should do (this issue https://github.com/rancher/rancher/issues/14731). To fix the problem we did the following work-a-round:
sudo docker exec -it rancher sh -c "rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json
sudo docker exec -it rancher k3s kubectl delete secret -n kube-system k3s-serving
docker run -d --name=rancher --restart=unless-stopped -p 443:443 -v /somedir/rancher:/var/lib/rancher -v /somedir/certs/cert.key:/etc/rancher/ssl/key.pem -v /somedir/certs/cert.pem:/etc/rancher/ssl/cert.pem -v /somedir/certs/cacerts.pem:/etc/rancher/ssl/cacerts.pem rancher/rancher:v2.X.X
Hello together, I assume I have the same problem here with rancher 2.3.1.
https://forums.rancher.com/t/rancher-container-keeps-restarting/18654
Unfortunately I do have the tls certficate folder under /var/lib/rancher/management-state/tls
However, I found certificates under the following path sudo ls k3s/server/tls Should I delete all those certs and would I have to delete *.crt, *.key and the temporary-certs folder?
By the way I am running on RancherOS there seems to be no timedatectl command available. How can I install it?
client-admin.crt client-auth-proxy.key client-controller.crt client-kube-apiserver.key client-kubelet.key request-header-ca.crt server-ca.key serving-kube-apiserver.key client-admin.key client-ca.crt client-controller.key client-kube-proxy.crt client-scheduler.crt request-header-ca.key service.key serving-kubelet.key client-auth-proxy.crt client-ca.key client-kube-apiserver.crt client-kube-proxy.key client-scheduler.key server-ca.crt serving-kube-apiserver.crt temporary-certs
Ok, solved the problem by deleting all files and the temporary-certs folder in the tls folder and than did a server restart. Running containers needed to be restarted in order to get things up again.
As a feature request, I’d appreciate if the Rancher API would contain the cluster certificate expiry dates. Currently the API shows when the cluster was created (“created”) but no information about the Kubernetes (-internal) certificates. This would extremely help to add this into monitoring (e.g. with check_rancher2 monitoring plugin).
Same here! The following helped:
For anyone still running Rancher as Single Node Docker container, I’ve just published a guide on how you can migrate your Single Node Docker installation to a (HA) Kubernetes / K3s cluster on my blog. Unlike the documentation of Rancher, this guide will actually cover every step of the way to get you migrated, along with the hoops you’ll have to jump to because Rancher is running as a container on Docker.
This migration didn’t used to be possible prior to v2.5.0, but it’s officially supported since (though as mentioned in the guide, I would advice on running versions prior to v2.5.8 if you still need to upgrade).
It also covers automatically restarting K3s (using a CRON job) every 14 days so your certificates will always be renewed in time. It won’t fix your current issue, but hopefully will help you avoid running into it in the future.
Ok have been trying to renew my Certificate for a week now, but nothing worked. There was one little secret missing:
Rancher Version v2.5.9
Had to delete
dynamic-cert.json
,k3s-serving
secret in kube-system namespace andserving-cert
secret in cattle-system namespace. Restart Container after this and done.Beyond annoying. Look at how old this issue is.
After restoring the Rancher UI instance, I faced another issue, the cluster stuck at updating status
The logs of the rancher
Can anyone help?
In my case, we were running rancher as a single node docker container using let’s encrypt certificate (–acme-domain install). The container is hosted on an Ubuntu VM running in Azure. There was an inbound rule on the NSG of this VM to deny all HTTP traffic (port 80). This was disallowing the certificate renewal. Deleted this rule and added a new one to allow traffic temporarily. After this, forcing a certificate regeneration by following the steps from @GameScripting solved it for me.
For those who use the let’s encrypt certificate from Rancher, ensure that there is no firewall appliance or resource like NSG that is blocking traffic on port 80/443 during the certificate renewal. Whitelisting the required ip addresses is an option. In my case, I just temporarily allowed all HTTP traffic to flow to the VM and later blocked it again after the renewal was completed.
Below command can be used to verify the status of the certificate.
openssl s_client -connect localhost:443 -showcerts </dev/null 2>&1 | openssl x509 -noout -startdate -enddate
Just putting this here so that it helps someone in the future.
Here is my case solved:
Step I ssh to your rancher server
rm -Rf /var/lib/rancher-volume/k3s/server/tls
#[rancher-volume] is the name volume that I name it when setup rancher: docker run -d -p 20443:443 --name=rancher-master --restart=unless-stopped -v /var/lib/rancher-volume/:/var/lib/rancherStep II remove the current rancher-container
docker rm -f rancher-master
#[rancher-master] name of my rancher containerStep III Re-create rancher-container
docker run -d -p 20443:443 --name=rancher-master --restart=unless-stopped -v /var/lib/rancher-volume/:/var/lib/rancher rancher:stable
#same version you deleted or your_version neededStep IV
sudo reboot
wait for 5 mins it back alive 👍I also hit this issue on 2.3.1. There is no file on /var/lib/rancher/management-state/tls Reading this git point us on : k3s/server/tls dir. Should I delete all those certs and would I have to delete *.crt, *.key and the temporary-certs folder? Or simply upgrade rancher as suggested ?
Not really clear.
I’m having the same issue as @amagura in #29475 , tried the same thing that @OneideLuizSchneider describes in Rancher v2.3.3, it still fails over the k3s certificates being bad before it even gets to refreshing them I guess? Anything else I should delete in order to have it regenerate them?
Logs:
If I start Rancher after deleting the files, but without setting the back to ‘now’, it will start succesfully, but won’t regenerate them either.
Yes they’re definitely gone:
I’m encountering this issue on Rancher v2.3.6. Removing
token-node.crt
andlocalhost.crt
does not resolve the issue for me.Following @OneideLuizSchneider’s advice and deleting the bad certs, updating the time and restarting the container fixed it for me. Just was a different set of certs in my case.
@ArgonV the certs in question seem to be generated for some sort of proxy internal to Rancher. If you dump out the certs with openssl you’ll find
CN=kubernetes
and SANsDNS:kubernetes.default.svc, DNS:kubernetes.default, DNS:kubernetes, DNS:localhost, IP Address:127.0.0.1, IP Address:10.43.0.1
.This drove me nuts all day, literally everything but Rancher was working fine and had certs valid until 2029.
Can confirm that stopping the container, deleting
token-node.crt
andlocalhost.crt
, and starting the container back up caused the certificates to be regenerated.I just got the bad tls: bad certificate today, taking the clock back a couple of hours and deleting /var/lib/rancher/management-state/tls/localhost.crt and /var/lib/rancher/management-state/tls/token-node.crt followed by a restart of the container did the trick
@justincarter I restored snapshots for the etcd and rancher master VMs. Then I applied your solution and worked like a charm! The cluster was stuck rotating certificates
I’m having the same issue.
These steps worked for me, running rancher version 2.3.11 on docker:
Thanks for all the above answers.
None of the solutions above really helped me.
In the end, removing the
k3s/server/tls
directory + upgrading rancher from2.5.6
to2.5.8
solved it.The steps mentioned by @cepefernando worked! Thanks so much! https://github.com/rancher/rancher/issues/26984#issuecomment-818770519 This was with Rancher 2.4.5.
@cepefernando I also have the K3s subsystem. I removed the k3s certificate just via Rancher UI after I set the time a few days back. Worked like a charm.
You are my hero @dnauck just wasted 5 hours of debugging and trying everything possible, but only your solution worked!
Just to help all the other folks with old and busted clusters from real world scenarios 😉: Given: Rancher 3.5.0 restarts continously because of expired certificates.
My solution:
From now it works.
No nosso caso aqui o cluster parou de funcionar por causa do certificado e fizemos o seguinte para resolver.
docker restart ranchermaster docker exec -it ranchermaster sh -c “mv /var/lib/rancher/k3s/server/tls/ /var/lib/rancher/k3s/server/tls.bak” docker restart ranchermaster
My
/var/lib/rancher/management-state/tls
folder had 6 months+ expired certificates, so I assumed it was safe to delete. After a backup, of course.Same happened with me. Thanks @justincarter your fix helped me.
Thanks @justincarter! I ran into the exact same issue today and this helped me fix my Rancher installation as well. In may case, I didn’t have the
/etc/kubernetes/ssl/
directory or the/var/lib/rancher/management-state/certs/bundle.json
file but deleting the other 2 and restarting worked!