cert-manager: Vault Issuer does not retry signing CertificateRequests if the status is pending
Describe the bug:
When the Vault issuer sets a CertificateRequest’s status to Pending (due to a failure initializing the Vault client or a missing secret), that CR is never retried or put back onto the controller rate-limiting workqueue. Since the state is pending and not failed, there’s no retry logic to re-attempt the issuance process. This is especially relevant in certificate renewal contexts.
Example of a CertificateRequest stuck in Pending (retrieved from the cluster a week later):
Name: istiod-1-11-0-h8jh9
Namespace: istio-system
Labels: <none>
Annotations: cert-manager.io/certificate-name: istiod-1-11-0
cert-manager.io/certificate-revision: 636
cert-manager.io/private-key-secret-name: istiod-1-11-0-nrsp8
API Version: cert-manager.io/v1
Kind: CertificateRequest
Metadata:
Creation Timestamp: 2021-10-18T21:52:03Z
Generate Name: istiod-1-11-0-
Generation: 1
...
Spec:
Duration: 1h0m0s
Groups:
system:serviceaccounts
system:serviceaccounts:cert-manager
system:authenticated
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: istio-vault-ca
Request: ${CR request}
UID: ef8b5dce-01a4-487d-b96f-0e07a6596aac
Username: system:serviceaccount:cert-manager:cert-manager
Status:
Conditions:
Last Transition Time: 2021-10-18T21:52:03Z
Message: Certificate request has been approved by cert-manager.io
Reason: cert-manager.io
Status: True
Type: Approved
Last Transition Time: 2021-10-18T21:53:41Z
Message: Failed to initialise vault client for signing: error reading Kubernetes service account token from vault-cluster-issuer-static-token: error calling Vault server: Error making API request.
URL: POST ${MY-VAULT-INSTANCE}
Code: 504. Raw Message:
upstream request timeout
Reason: Pending
Status: False
Type: Ready
Events: <none>
Expected behaviour:
The CertificateRequest should either be set to Failed, or the Pending CR should re-attempt issuance.
Steps to reproduce the bug:
Use the Vault CA to create a Certificate manifest with an expiry (an earlyRenewBefore value will help with testing) . When the renewal time comes, make it fail for some reason (e.g. network timeout). The certificate will stay pending with no retry.
Anything else we need to know?: I believe this bug happens because the Vault issuer doesn’t return an error when it encounters an error building the Vault client. If an error was returned, the controller would re-queue the resource to be processed again until the there was no longer an error encountered (or the request failed). Is there a reason this isn’t the current behavior?
Environment details::
- Kubernetes version: 1.20
- Cloud-provider/provisioner: GKE
- cert-manager version: v1.4.1
- Install method: Helm
/kind bug
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 2
- Comments: 31 (6 by maintainers)
Thanks for showing interest in fixing this @mmontes11
We haven’t had a chance to look into this issue yet.
I think the reason why this was originally set to
Pendingwas that we were afraid to spam Vault server. Having said that, I think that the approach you suggest should be safe enough:This makes sense I think. It would mean that when requests to Vault fail with a 4xx Vault error, they get retried with exponential backoff so should prevent us from spamming Vault 👍🏼
I think that in case of non-Vault-4xx error, we should be able to simply return an error in the controller to reconcile the same
CertificateRequestwith the default Kubernetes controller backoff. This also is how ACME controllers currently work, see i.e here.Out of interest, what errors do you typically see?
(The change in behaviour where if an issuance request to Vault fails with 4xx error we enter exponential backoff is a slight change in behaviour- previously the CertificateRequest would have been pending, so if the reason for the failure got fixed by i.e modifying a Secret with Vault’s credentials then the CertificateRequest would have been retried immediately whereas now we will back off for at least an hour (failed CertificateRequest means exponential backoff). I think this is fine though as we should have checked the Secret before attempting issuance and some more advanced issuer health check could also be added.)
We are running into this issue and I would want to contribute a fix for this.
@irbekrm instead of trying to fix it for ALL retriable cases, can we just ignore the cases which are not clear as of now. That way at least we have some error handling and better reliability.
Please let me know what you think. I can raise a PR if that sounds ok.
@mmontes11 that is interesting, if you actually saw this error before and it was a vault connectivity issue as well, then we are in the same situation, and upgrading cert-manager won’t fix anything until this topic is properly handled.
Hello, We have faced the same situation but with another error this time:
The secret actually existed in this case.
One thing to notice here is that we are still with cert-manager 1.5.4 while we have upgraded to k8s 1.23, so it might be a version problem, we will upgrade soon to the latest cert-manager.