cert-manager: Certificate sometimes fails to issue properly
Describe the bug:
We are using cert-manager + letsencrypt staging in our test pipeline. The pipeline only runs a few times a week, so it’s well under the limits of letsencrypt staging. Our certificate sometimes fails to issue properly. We know that our config is OK since it usually works fine. However in the last week we had two failures caused by cert-manager failing to issue the certificate. We did not encounter this problem before this week.
Our certificate has this spec:
spec:
dnsNames:
- '*.test-khhyml7mh2jkg732.loci.ubi.com'
issuerRef:
kind: ClusterIssuer
name: letsencrypt-test
secretName: wildcard-cert-tls
Our cluster issuer has this spec:
spec:
acme:
email: <redacted>@ubisoft.com
preferredChain: ""
privateKeySecretRef:
name: letsencrypt-test-issuer-account-key
server: https://acme-staging-v02.api.letsencrypt.org/directory
solvers:
- dns01:
cloudDNS:
project: bob-dbaas-dev
serviceAccountSecretRef:
key: credentials.json
name: cert-manager-google-secret
The cert manager logs are below:
I0120 16:14:44.476975 1 conditions.go:173] Setting lastTransitionTime for Certificate "wildcard-cert" condition "Issuing" to 2021-01-20 16:14:44.476965945 +0000 UTC m=+13.944761228
I0120 16:14:44.477078 1 conditions.go:173] Setting lastTransitionTime for Certificate "wildcard-cert" condition "Ready" to 2021-01-20 16:14:44.477064448 +0000 UTC m=+13.944859754
E0120 16:14:44.575829 1 controller.go:158] cert-manager/controller/CertificateTrigger "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"wildcard-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="mongo-operator/wildcard-cert"
I0120 16:14:44.575909 1 conditions.go:173] Setting lastTransitionTime for Certificate "wildcard-cert" condition "Issuing" to 2021-01-20 16:14:44.575903168 +0000 UTC m=+14.043698423
I0120 16:14:44.580373 1 setup.go:90] cert-manager/controller/clusterissuers "msg"="generating acme account private key" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1"
I0120 16:14:44.813391 1 conditions.go:233] Setting lastTransitionTime for CertificateRequest "wildcard-cert-mjj2b" condition "Ready" to 2021-01-20 16:14:44.813382149 +0000 UTC m=+14.281177396
I0120 16:14:44.922381 1 setup.go:178] cert-manager/controller/clusterissuers "msg"="ACME server URL host and ACME private key registration host differ. Re-checking ACME account registration" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1"
I0120 16:14:45.391446 1 setup.go:270] cert-manager/controller/clusterissuers "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1"
I0120 16:14:45.391473 1 conditions.go:92] Setting lastTransitionTime for Issuer "letsencrypt-test" condition "Ready" to 2021-01-20 16:14:45.391467192 +0000 UTC m=+14.859262437
I0120 16:14:45.526098 1 setup.go:178] cert-manager/controller/clusterissuers "msg"="ACME server URL host and ACME private key registration host differ. Re-checking ACME account registration" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1"
E0120 16:14:45.620396 1 controller.go:158] cert-manager/controller/orders "msg"="re-queuing item due to error processing" "error"="ACME client for issuer not initialised/available" "key"="mongo-operator/wildcard-cert-mjj2b-2407777396"
I0120 16:14:46.048213 1 setup.go:270] cert-manager/controller/clusterissuers "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1"
I0120 16:14:49.922449 1 setup.go:178] cert-manager/controller/clusterissuers "msg"="ACME server URL host and ACME private key registration host differ. Re-checking ACME account registration" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1"
I0120 16:14:50.388598 1 setup.go:270] cert-manager/controller/clusterissuers "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1"
E0120 16:14:53.757052 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
E0120 16:14:53.791518 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
E0120 16:15:03.762095 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
E0120 16:15:13.767389 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
E0120 16:15:23.789744 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
E0120 16:15:33.794373 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
E0120 16:15:43.799491 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
E0120 16:15:53.835277 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
E0120 16:17:34.960129 1 sync.go:354] cert-manager/controller/challenges/acceptChallenge "msg"="error waiting for authorization" "error"="acme: authorization error for test-khhyml7mh2jkg732.loci.ubi.com: 400 urn:ietf:params:acme:error:dns: During secondary validation: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.test-khhyml7mh2jkg732.loci.ubi.com - check that a DNS record exists for this domain" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01"
I0120 16:17:35.135532 1 conditions.go:162] Found status change for Certificate "wildcard-cert" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-20 16:17:35.135521875 +0000 UTC m=+184.603317157
E0120 16:17:35.174806 1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"wildcard-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="mongo-operator/wildcard-cert"
I0120 16:17:35.176107 1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="mongo-operator/wildcard-cert" "retry_after"="2021-01-20T17:17:35Z"
I0120 16:17:35.233041 1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="mongo-operator/wildcard-cert" "retry_after"="2021-01-20T17:17:35Z"
E0120 16:17:35.247545 1 controller.go:158] cert-manager/controller/CertificateKeyManager "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"wildcard-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="mongo-operator/wildcard-cert"
Also, status on certificate
- lastTransitionTime: "2021-01-20T17:17:35Z"
message: 'The certificate request has failed to complete and will be retried:
Failed to wait for order resource "wildcard-cert-mjj2b-2407777396" to become
ready: order is in "invalid" state: '
reason: Failed
status: "False"
type: Issuing
Expected behaviour:
Certificate should issue properly
Steps to reproduce the bug:
Setup a cluster issuer with letsencrypt staging, ask for a certificate.
Environment details:
- Kubernetes version: 1.16
- Cloud-provider/provisioner: GKE
- cert-manager version: 1.1.0
- Install method: helm
/kind bug
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 40
- Comments: 40 (2 by maintainers)
Upon continuing investigation, a couple more thoughts:
The Certificate and CertificateRequest objects show the error:
The certificate request has failed to complete and will be retried: Failed to wait for order resource "cert-api-xxx-com-ms2bc-3188650946" to become ready: order is in "invalid" state:The second part,
Failed to wait for order..., doesn’t appear to match actual logic. I interpret the wording as “the CertificateRequest screwed up and improperly didn’t wait for the Order to become ready”. However, the source indicates that “waiting” isn’t really the issue. In reality, the Order expressly failed and thus the CertificateRequest itself is now being marked as failed.The first part is also misleading as it appears that the Order being in an “invalid” state is a terminal condition for cert-manager and the CertificateRequest cannot be successfully retried (at least at present).
The DNS errors seem to be on the LetsEncrypt side. I don’t know the ACME protocol (or LE’s implementation) well enough to know if DNS errors during challenges can be retried or not. If so, then it would seem cert-manager needs to do so. If not, then cert-manager needs to delete the entire CertificateRequest (or Order) and then recreate/retry accordingly.
In the absence of such retry logic, it’d be quite helpful if cert-manager’s log and error messages were changed to indicate that a retry is not possible and that the CertificateRequest should be manually deleted.
A potentially improved error message, addressing all of the above:
The certificate request "cert-api-xxx-com-ms2bc" has failed to complete and should be deleted to retry: Order resource "cert-api-xxx-com-ms2bc-3188650946" is not ready: order state is "invalid":Additionally, as I don’t think this is being logged yet, it’d be super helpful to log this at Error level so it’s more easily discovered and manually corrected.For anyone else stumbling across this thread, I’ve found that recovery requires deleting the CertificateRequest then manually triggering a Cert renewal (or waiting for the next hourly cycle). Deleting the Order only is insufficient.
+1 found here too
Updating to cert-manager v1.5.0 fixed the issue for us and unstuck the failed renewals from the previous version 🎉
I believe this issue can be closed.
We are aiming to improve testing in this area over the next couple of months with our Google Summer of Code student. Hopefully this will surface bugs to us.
/priority important-soon
The same happens with Vault Issuer (related https://github.com/jetstack/cert-manager/issues/3941 ). It seems that after
CertificateRequesttransitions toFailedstate it never gets removed, which is why removing it manually fixes the issue.The
requestmanager-controlleronly removesCertificateRequestswhenCertificatespec changed, it doesn’t seem to check state at all. I think it should also remove failed requests since those will never get processed again.https://github.com/jetstack/cert-manager/blob/c7514d92624613395eb00da8594ffa5391a0b169/pkg/controller/certificates/requestmanager/requestmanager_controller.go#L189
+1 same issue here
Thanks for the great comment @hugoShaka !
There was indeed a bug in cert-manager where failed
CertificateRequests were not retried that should now be fixed in v1.5.0 release so hopefully that’ll help with some of the cases where renewals are not retried after one time failure.From my understanding #3501 is totally unrelated, the cert is issued, they are failing to use it, this is likely not even a certmanager issue but an ingress-controller/http-client one.
I’m not sure for #3896 but think it’s not related either: the controller refuses to treat the orders because they don’t belong to the certificate request, which is a bit weird, as if something had removed the ownership information. At least, there’s no such thing happening in @gmauleon 's examples.
The following message
is not the root cause, it’s something benign that happens when multiple controllers are editing the same resource at the same time. This is handled gracefully by re-queuing and the resource will eventually reach a stable state a few reconciliation cycles after (as long as the controllers are not fighting against each other, which is not the case here).
As far as I understand, this issue is about the operator not being able to do something automatically with a failed certificate request (clean it for example). What seems to happen is a flakiness at LE level (during the secondary challenge validation most of the time) causing at least a challenge of the
CertificateRequestto fail.One could expect the operator to retry at least once with another fresh request, a couple of days after. The risk of retrying to aggressively is to flood LE with challenges while something is really broken (and burn your quotas). It’s legitimate for the operator to let things as-is for a human operator to investigate if something goes sideways. I think this is why the controller does not removes
FailedCertificateRequeststoday. This is not really a bug and will likely require more time and design to address.In the meantime the easiest workaround is to look for
CertificateRequestsin aFailedstate and remove them while being careful to not retry too much if something is really broken. If you handle a few clusters this can easily be done by hand. This becomes an issue when you manage more clusters and will require a bit of development on your side. The simplest method would be to run a cron job deleting failedCertificateRequestson a daily basis./milestone Next
I keep having this problem over and over (we have over 60 clusters with cert-manager installed so it’s pretty easy statistically…). Anyone had the time to have a look at that?
Challenge invalid because of a DNS problem, certificate/order never recover from there.
I had some similar issues on multiple clusters with version
1.0.4, seems to appear randomly 😢 . Symptoms: some challenges are stuck in an invalid state that will never recover (i.e: 70% will be good and valid, the rest are stuck in invalid, got 12 hostnames including wildcards to validate on this particular certificate).It takes some tinkering to unlock the situation (delete invalid challenges or delete the whole certificate request + wait for the hour retry condition to kick in).
Here are the logs (the one I have between a pod restart and when I actually connected to the cluster after some custom alerts raised my attention).
FYI: Last time I had the problem I deleted the whole certificatrequest (which cleaned-up order and challenges), when it retried, it only issued some of the challenges (probably those that were invalid before???) before issuing the certificate successfully.
Seeing a lot of https://github.com/jetstack/cert-manager/issues/3516 problematic logs so just mentioning it here.
I can confirm the same behavior occurs in our CI runs with an uptick in failure rate in the last couple of days.
Notice
SERVFAIL looking up TXTinstead ofSERVFAIL looking up CAAas reported by @gmauleon in https://github.com/jetstack/cert-manager/issues/3594#issuecomment-773450922.