cert-manager: Vault Issuer does not retry signing CertificateRequests if the status is pending

Describe the bug:

When the Vault issuer sets a CertificateRequest’s status to Pending (due to a failure initializing the Vault client or a missing secret), that CR is never retried or put back onto the controller rate-limiting workqueue. Since the state is pending and not failed, there’s no retry logic to re-attempt the issuance process. This is especially relevant in certificate renewal contexts.

Example of a CertificateRequest stuck in Pending (retrieved from the cluster a week later):

Name:         istiod-1-11-0-h8jh9
Namespace:    istio-system
Labels:       <none>
Annotations:  cert-manager.io/certificate-name: istiod-1-11-0
              cert-manager.io/certificate-revision: 636
              cert-manager.io/private-key-secret-name: istiod-1-11-0-nrsp8
API Version:  cert-manager.io/v1
Kind:         CertificateRequest
Metadata:
  Creation Timestamp:  2021-10-18T21:52:03Z
  Generate Name:       istiod-1-11-0-
  Generation:          1
...
Spec:
  Duration:  1h0m0s
  Groups:
    system:serviceaccounts
    system:serviceaccounts:cert-manager
    system:authenticated
  Issuer Ref:
    Group:   cert-manager.io
    Kind:    ClusterIssuer
    Name:    istio-vault-ca
  Request:   ${CR request}
  UID:       ef8b5dce-01a4-487d-b96f-0e07a6596aac
  Username:  system:serviceaccount:cert-manager:cert-manager
Status:
  Conditions:
    Last Transition Time:  2021-10-18T21:52:03Z
    Message:               Certificate request has been approved by cert-manager.io
    Reason:                cert-manager.io
    Status:                True
    Type:                  Approved
    Last Transition Time:  2021-10-18T21:53:41Z
    Message:               Failed to initialise vault client for signing: error reading Kubernetes service account token from vault-cluster-issuer-static-token: error calling Vault server: Error making API request.

URL: POST ${MY-VAULT-INSTANCE}
Code: 504. Raw Message:

upstream request timeout
    Reason:  Pending
    Status:  False
    Type:    Ready
Events:      <none>

Expected behaviour: The CertificateRequest should either be set to Failed, or the Pending CR should re-attempt issuance.

Steps to reproduce the bug: Use the Vault CA to create a Certificate manifest with an expiry (an earlyRenewBefore value will help with testing) . When the renewal time comes, make it fail for some reason (e.g. network timeout). The certificate will stay pending with no retry.

Anything else we need to know?: I believe this bug happens because the Vault issuer doesn’t return an error when it encounters an error building the Vault client. If an error was returned, the controller would re-queue the resource to be processed again until the there was no longer an error encountered (or the request failed). Is there a reason this isn’t the current behavior?

Environment details::

  • Kubernetes version: 1.20
  • Cloud-provider/provisioner: GKE
  • cert-manager version: v1.4.1
  • Install method: Helm

/kind bug

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 2
  • Comments: 31 (6 by maintainers)

Most upvoted comments

Thanks for showing interest in fixing this @mmontes11

We haven’t had a chance to look into this issue yet.

I think the reason why this was originally set to Pending was that we were afraid to spam Vault server. Having said that, I think that the approach you suggest should be safe enough:

400-500 HTTP codes: Handle the resp.StatusCode in internal/vault by returning a custom error and controlling it accordingly in the certificaterequest controller. Then, we could mark the CerttificateRequest as Failed instead of Pending, and this will perform the retries we need.

This makes sense I think. It would mean that when requests to Vault fail with a 4xx Vault error, they get retried with exponential backoff so should prevent us from spamming Vault 👍🏼

Network error: It’s hard to find an accurate way to detect network errors, so in order to mitigate this, we could tune up the vault client configuration , for example, by increasing the MaxRetries.

I think that in case of non-Vault-4xx error, we should be able to simply return an error in the controller to reconcile the same CertificateRequest with the default Kubernetes controller backoff. This also is how ACME controllers currently work, see i.e here.

Basically whenever cert-manager fails to request our multi cloud Vault,

Out of interest, what errors do you typically see?

(The change in behaviour where if an issuance request to Vault fails with 4xx error we enter exponential backoff is a slight change in behaviour- previously the CertificateRequest would have been pending, so if the reason for the failure got fixed by i.e modifying a Secret with Vault’s credentials then the CertificateRequest would have been retried immediately whereas now we will back off for at least an hour (failed CertificateRequest means exponential backoff). I think this is fine though as we should have checked the Secret before attempting issuance and some more advanced issuer health check could also be added.)

We are running into this issue and I would want to contribute a fix for this.

@irbekrm instead of trying to fix it for ALL retriable cases, can we just ignore the cases which are not clear as of now. That way at least we have some error handling and better reliability.

Please let me know what you think. I can raise a PR if that sounds ok.

@mmontes11 that is interesting, if you actually saw this error before and it was a vault connectivity issue as well, then we are in the same situation, and upgrading cert-manager won’t fix anything until this topic is properly handled.

Hello, We have faced the same situation but with another error this time:

Failed to initialise vault client for signing: error reading Kubernetes
        service account token from cert-manager-vault-issuer-token-2m8xd: secret
        "cert-manager-vault-issuer-token-2m8xd" not found

The secret actually existed in this case.

One thing to notice here is that we are still with cert-manager 1.5.4 while we have upgraded to k8s 1.23, so it might be a version problem, we will upgrade soon to the latest cert-manager.