rancher: How to rotate cattle-webhook-tls certificate when it has expired?
Rancher Server Setup
- Rancher version: 2.5.8
- Installation option (Docker install/Helm Chart): Helm Chart
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): EKS
- Proxy/Cert Details: External
Information about the Cluster
- Kubernetes version:1.20
- Cluster Type (Local/Downstream): Local
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): N/A
Describe the bug
the cluster webhook certificate seems to have expired and all RBAC operations are now blocked with the error message Internal error occurred: failed calling webhook "rancherauth.cattle.io": Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation?timeout=10s": x509: certificate has expired or is not yet valid
Culprit seems to be this secret containing webhook tls certificate expiring:
To Reproduce
Install Rancher via HELM chart and wait for a year.
Result Webhook certificate expires and no instructions from Rancher documentation on how to rotate rancher webhook certificate secret.
Expected Result
I searched a lot of places looking for instructions on how to rotate the certificate but was not able to find any.
SURE-3475 SURE-3737 SURE-3790 SURE-3528
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 31 (11 by maintainers)
Finally, I solved the problem modifying rancher-webhook deployment image rancher-webhook:v0.2.0 to rancher-webhook:v0.1.1, once the cattle-webhook-tls secret is created I switched back the image version again. And the certificate is already renewed. But first of all, you have to remove cattle-webhook-tls secret.
@sevko21 try deleting your webhook configurations, let your webhook pod relaunch, then restore the webhooks.
Then once the webhook is healthy:
The following commands should do what you need:
Once deleted, the rancher-webhook pod will get recreated and regenerate a new secret that should be good for another 1 year.
I was affected by this problem, too. The workaround worked for me.
The certificate rotation suggested on https://rancher.com/blog/2019/kubernetes-certificate-expiry-and-rotation-in-rancher-kubernetes-clusters did not help.
I ran into this on a brand new 2.6 cluster. I recreated the issue by deleting the cattle-webhook-tls secret and deleting the rancher-webhook pod. At that point, the logs were just repeating that it was trying to create a new one. Deleting the webhook and deleting the pod one more time fixed this. The webhook recreated both a new cert and the webhook config.
Everything appears to work that way, but I’d still do an ETCD backup before proceeding.
@Kampfmoehre when you’re already on 2.6.0, you have to downgrade the pod’s image to 0.1.1 as @xvi0101 stated. you can do this using the ui if you want. If you only change the pod you can kill the pod after successful cert creation, then deployment recreates a 0.2.0 one. We had this on both 2.5.9 and 2.6.0 today, 2.5.9 works without downgrading, 2.6.0 like said above. Btw.: We had this bug last year, too, and it seems to be there since 2.2. I’m loosing a bit of trust in rancher.
Two things we can do in the interim:
Add a docs change indicating the workaround
This rancher/docs change would involve linking this issue, and including the following workaround:
Delete the secret upon startup if expired
The below code block contains a high-level, partial pseudo code solution during the webhook startup sequence. Perhaps, a CronJob could be added to check that a cert is about to expire and pre-emptively delete the secret and restart the pod. This could require a dynamiclistener change (related commit) depending on the solution.
Code area: https://github.com/rancher/webhook/blob/60475a4a89a7230381e27881c649acd1a1ee261b/pkg/server/server.go#L41-L45
@xvi0101 thank you, that worked!!!
@gregod-com I had a slightly different symptom where my rancher-webhook pod had always been running but it kept saying the secret with expired certificate in it is the Active TLS secret and does not even attempt to renew it. I tried starting over with a fresh pod and got the same log. Then I tried backup -> removing the secret containing the expired certificate
cattle-webhook-tls
and started the pod over again and bang, the secret was regenerated with a renewed certificate.Thanks a lot for your hints!
hey @petertang2012, please wait for someone from rancher to confirm this, but I could get webhook back up and running (and therefore recreating the cattle-webhook-tls) by deleting the mutatingwebhookconfiguration rancher.cattle.io like so:
kubectl delete mutatingwebhookconfiguration rancher.cattle.io
But please keep in mind that this was just a guess based on changes in the rancher-webhook helm chart (i.e. the delete hooks created 2 months ago) and I’m not sure if this works for you and even not sure if this doesn’t break anything, since I didn’t look further to find out what is happening here. Be careful CheersRelease note
New and existing rancher-webhook deployments will automatically renew its TLS certificate when it is 30 or fewer days within expiration date.
QA testing
Root cause
The webhook’s dynamiclistener did not check if webhook’s certificate is about to expire, because the certificate value had not been set properly on the dynamiclistener server listener.
What was fixed, or what changes have occurred
The dynamiclistener has had a commit that we cherry-picked into a new version of webhook which is to be used in a new version of Rancher.
Areas or cases that should be tested
What areas could experience regressions?
Testing steps
Use a new v2.5.12 cluster
Deploy a new v2.5.12 pre-release Rancher container. Find the secret called
cattle-webhook-tls
in thecattle-system
namespace. Copy the certificate contents and decode them. You can use a site likehttps://certificatedecoder.dev
. Observe thatExpires On
shows a date 10 years from today.You need to test that the webhook pod keeps renewing the certificate if necessary (if it is 30 days or fewer until expiration). Since your current certificate will expire in 10 years, you would have to wait for 10 years minus 30 days to see it renewed before it is set to expire. Instead, delete the certificate secret (by hand or with
kubectl delete pod -n cattle-system -l app=rancher-webhook
).Go to the
rancher-webhook
deployment in thecattle-system
namespace. Add an environment variable to it calledCATTLE_NEW_SIGNED_CERT_EXPIRATION_DAYS
and set its value to 2. Save the deployment and observe that its pod is deleted and a new one is created in its place. If you see an error on save that a service already exists, ignore it, that’s something else; the deployment has been saved successfully.Check that the certificate has been recreated. Decode its contents and observe that it is valid for 2 days. The webhook pod checks in on its certificate expiration every 6 hours. In 6 hours, you can observe that it has updated (not recreated) the certificate, and the cert’s
Expires On
date is extended by 2 days again (notice the slight time difference of a few seconds/minutes).Because the certificate is now only valid for 2 days, as you specified, you will see the webhook pod try to update it, extend its validity period by 2 days. It will keep doing this because 2 days is fewer than 30 days, so webhook thinks it’s time to update it ahead of expiration.
To get a 10-year certificate, update the deployment again by removing the environment variable and save. Absence of that variable means the pod should make a cert that is valid for 10 years, the default.
Upgrade from v2.5.11 to v2.5.12
Now get a new v2.5.11 cluster and upgrade it to v2.5.12. The webhook app will be upgraded in a couple of minutes. Therefore, the
rancher-webhook
deployment will also be upgraded and its pod recreated. However, the certificate will remain the same, since it already exists, has not expired yet, and isn’t 30 days within expiration. You can verify this by checking that the cert contents have not changed at all. But the new pod will now watch it for expiration. And 30 days before expiration, it will renew the cert.Check the image of the pod made by the
rancher-webhook
deployment, make sure it isv0.1.3-rc1
(the number afterrc
might be different).Run the same routine with deleting the secret, changing the deployment’s
CATTLE_NEW_SIGNED_CERT_EXPIRATION_DAYS
variable to less than 30 and observe that in 6 hours, the secret will be updated (again, not recreated). Check its expiration date and ensure it’s 10 years in the future.Since this is an important issue and needs to be solved, we’ll be implementing a partial fix via this child issue in the meantime to extend the certificate expiration to 10 years. It is a self-signed cert, so the security concerns are minimal. Ideally, we will fix this issue as well (likely via a weekly CronJob that checks if the cert will expire in <30d in advance or with a fix to dynamiclistener). Thank you all for your patience; I agree that this needs to be solved.
I wonder if someone from Rancher can clarify for me why rancher-webhook workload kept using a secret with expired certificate without even checking it validity period and take respective actions.