cert-manager: Certificate sometimes fails to issue properly

Describe the bug:

We are using cert-manager + letsencrypt staging in our test pipeline. The pipeline only runs a few times a week, so it’s well under the limits of letsencrypt staging. Our certificate sometimes fails to issue properly. We know that our config is OK since it usually works fine. However in the last week we had two failures caused by cert-manager failing to issue the certificate. We did not encounter this problem before this week.

Our certificate has this spec:

  spec:
    dnsNames:
    - '*.test-khhyml7mh2jkg732.loci.ubi.com'
    issuerRef:
      kind: ClusterIssuer
      name: letsencrypt-test
    secretName: wildcard-cert-tls

Our cluster issuer has this spec:

  spec:
    acme:
      email: <redacted>@ubisoft.com
      preferredChain: ""
      privateKeySecretRef:
        name: letsencrypt-test-issuer-account-key
      server: https://acme-staging-v02.api.letsencrypt.org/directory
      solvers:
      - dns01:
          cloudDNS:
            project: bob-dbaas-dev
            serviceAccountSecretRef:
              key: credentials.json
              name: cert-manager-google-secret

The cert manager logs are below:

I0120 16:14:44.476975       1 conditions.go:173] Setting lastTransitionTime for Certificate "wildcard-cert" condition "Issuing" to 2021-01-20 16:14:44.476965945 +0000 UTC m=+13.944761228
I0120 16:14:44.477078       1 conditions.go:173] Setting lastTransitionTime for Certificate "wildcard-cert" condition "Ready" to 2021-01-20 16:14:44.477064448 +0000 UTC m=+13.944859754
E0120 16:14:44.575829       1 controller.go:158] cert-manager/controller/CertificateTrigger "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"wildcard-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="mongo-operator/wildcard-cert" 
I0120 16:14:44.575909       1 conditions.go:173] Setting lastTransitionTime for Certificate "wildcard-cert" condition "Issuing" to 2021-01-20 16:14:44.575903168 +0000 UTC m=+14.043698423
I0120 16:14:44.580373       1 setup.go:90] cert-manager/controller/clusterissuers "msg"="generating acme account private key" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1" 
I0120 16:14:44.813391       1 conditions.go:233] Setting lastTransitionTime for CertificateRequest "wildcard-cert-mjj2b" condition "Ready" to 2021-01-20 16:14:44.813382149 +0000 UTC m=+14.281177396
I0120 16:14:44.922381       1 setup.go:178] cert-manager/controller/clusterissuers "msg"="ACME server URL host and ACME private key registration host differ. Re-checking ACME account registration" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1" 
I0120 16:14:45.391446       1 setup.go:270] cert-manager/controller/clusterissuers "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1" 
I0120 16:14:45.391473       1 conditions.go:92] Setting lastTransitionTime for Issuer "letsencrypt-test" condition "Ready" to 2021-01-20 16:14:45.391467192 +0000 UTC m=+14.859262437
I0120 16:14:45.526098       1 setup.go:178] cert-manager/controller/clusterissuers "msg"="ACME server URL host and ACME private key registration host differ. Re-checking ACME account registration" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1" 
E0120 16:14:45.620396       1 controller.go:158] cert-manager/controller/orders "msg"="re-queuing item due to error processing" "error"="ACME client for issuer not initialised/available" "key"="mongo-operator/wildcard-cert-mjj2b-2407777396" 
I0120 16:14:46.048213       1 setup.go:270] cert-manager/controller/clusterissuers "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1" 
I0120 16:14:49.922449       1 setup.go:178] cert-manager/controller/clusterissuers "msg"="ACME server URL host and ACME private key registration host differ. Re-checking ACME account registration" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1" 
I0120 16:14:50.388598       1 setup.go:270] cert-manager/controller/clusterissuers "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-test-issuer-account-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-test" "resource_namespace"="" "resource_version"="v1" 
E0120 16:14:53.757052       1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
E0120 16:14:53.791518       1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
E0120 16:15:03.762095       1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
E0120 16:15:13.767389       1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
E0120 16:15:23.789744       1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
E0120 16:15:33.794373       1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
E0120 16:15:43.799491       1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
E0120 16:15:53.835277       1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"test-khhyml7mh2jkg732.loci.ubi.com\" not yet propagated" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
E0120 16:17:34.960129       1 sync.go:354] cert-manager/controller/challenges/acceptChallenge "msg"="error waiting for authorization" "error"="acme: authorization error for test-khhyml7mh2jkg732.loci.ubi.com: 400 urn:ietf:params:acme:error:dns: During secondary validation: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.test-khhyml7mh2jkg732.loci.ubi.com - check that a DNS record exists for this domain" "dnsName"="test-khhyml7mh2jkg732.loci.ubi.com" "resource_kind"="Challenge" "resource_name"="wildcard-cert-mjj2b-2407777396-2216626796" "resource_namespace"="mongo-operator" "resource_version"="v1" "type"="DNS-01" 
I0120 16:17:35.135532       1 conditions.go:162] Found status change for Certificate "wildcard-cert" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-20 16:17:35.135521875 +0000 UTC m=+184.603317157
E0120 16:17:35.174806       1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"wildcard-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="mongo-operator/wildcard-cert" 
I0120 16:17:35.176107       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="mongo-operator/wildcard-cert" "retry_after"="2021-01-20T17:17:35Z"
I0120 16:17:35.233041       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="mongo-operator/wildcard-cert" "retry_after"="2021-01-20T17:17:35Z"
E0120 16:17:35.247545       1 controller.go:158] cert-manager/controller/CertificateKeyManager "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"wildcard-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="mongo-operator/wildcard-cert" 

Also, status on certificate

- lastTransitionTime: "2021-01-20T17:17:35Z"
      message: 'The certificate request has failed to complete and will be retried:
        Failed to wait for order resource "wildcard-cert-mjj2b-2407777396" to become
        ready: order is in "invalid" state: '
      reason: Failed
      status: "False"
      type: Issuing

Expected behaviour:

Certificate should issue properly

Steps to reproduce the bug:

Setup a cluster issuer with letsencrypt staging, ask for a certificate.

Environment details:

  • Kubernetes version: 1.16
  • Cloud-provider/provisioner: GKE
  • cert-manager version: 1.1.0
  • Install method: helm

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 40
  • Comments: 40 (2 by maintainers)

Most upvoted comments

Upon continuing investigation, a couple more thoughts:

The Certificate and CertificateRequest objects show the error: The certificate request has failed to complete and will be retried: Failed to wait for order resource "cert-api-xxx-com-ms2bc-3188650946" to become ready: order is in "invalid" state:

The second part, Failed to wait for order..., doesn’t appear to match actual logic. I interpret the wording as “the CertificateRequest screwed up and improperly didn’t wait for the Order to become ready”. However, the source indicates that “waiting” isn’t really the issue. In reality, the Order expressly failed and thus the CertificateRequest itself is now being marked as failed.

The first part is also misleading as it appears that the Order being in an “invalid” state is a terminal condition for cert-manager and the CertificateRequest cannot be successfully retried (at least at present).

The DNS errors seem to be on the LetsEncrypt side. I don’t know the ACME protocol (or LE’s implementation) well enough to know if DNS errors during challenges can be retried or not. If so, then it would seem cert-manager needs to do so. If not, then cert-manager needs to delete the entire CertificateRequest (or Order) and then recreate/retry accordingly.

In the absence of such retry logic, it’d be quite helpful if cert-manager’s log and error messages were changed to indicate that a retry is not possible and that the CertificateRequest should be manually deleted.

A potentially improved error message, addressing all of the above: The certificate request "cert-api-xxx-com-ms2bc" has failed to complete and should be deleted to retry: Order resource "cert-api-xxx-com-ms2bc-3188650946" is not ready: order state is "invalid": Additionally, as I don’t think this is being logged yet, it’d be super helpful to log this at Error level so it’s more easily discovered and manually corrected.


For anyone else stumbling across this thread, I’ve found that recovery requires deleting the CertificateRequest then manually triggering a Cert renewal (or waiting for the next hourly cycle). Deleting the Order only is insufficient.

kubectl delete certificaterequest <cert-req-name>
kubectl cert-manager renew <cert-name>

+1 found here too

Updating to cert-manager v1.5.0 fixed the issue for us and unstuck the failed renewals from the previous version 🎉

I believe this issue can be closed.

We are aiming to improve testing in this area over the next couple of months with our Google Summer of Code student. Hopefully this will surface bugs to us.

/priority important-soon

The same happens with Vault Issuer (related https://github.com/jetstack/cert-manager/issues/3941 ). It seems that after CertificateRequest transitions to Failed state it never gets removed, which is why removing it manually fixes the issue.

The requestmanager-controller only removes CertificateRequests when Certificate spec changed, it doesn’t seem to check state at all. I think it should also remove failed requests since those will never get processed again.

https://github.com/jetstack/cert-manager/blob/c7514d92624613395eb00da8594ffa5391a0b169/pkg/controller/certificates/requestmanager/requestmanager_controller.go#L189

+1 same issue here

Thanks for the great comment @hugoShaka !

As far as I understand, this issue is about the operator not being able to do something automatically with a failed certificate request (clean it for example). …One could expect the operator to retry at least once with another fresh request, a couple of days after.

There was indeed a bug in cert-manager where failed CertificateRequests were not retried that should now be fixed in v1.5.0 release so hopefully that’ll help with some of the cases where renewals are not retried after one time failure.

From my understanding #3501 is totally unrelated, the cert is issued, they are failing to use it, this is likely not even a certmanager issue but an ingress-controller/http-client one.

I’m not sure for #3896 but think it’s not related either: the controller refuses to treat the orders because they don’t belong to the certificate request, which is a bit weird, as if something had removed the ownership information. At least, there’s no such thing happening in @gmauleon 's examples.

The following message

the object has been modified; please apply your changes to the latest version and try again

is not the root cause, it’s something benign that happens when multiple controllers are editing the same resource at the same time. This is handled gracefully by re-queuing and the resource will eventually reach a stable state a few reconciliation cycles after (as long as the controllers are not fighting against each other, which is not the case here).

As far as I understand, this issue is about the operator not being able to do something automatically with a failed certificate request (clean it for example). What seems to happen is a flakiness at LE level (during the secondary challenge validation most of the time) causing at least a challenge of the CertificateRequest to fail.

One could expect the operator to retry at least once with another fresh request, a couple of days after. The risk of retrying to aggressively is to flood LE with challenges while something is really broken (and burn your quotas). It’s legitimate for the operator to let things as-is for a human operator to investigate if something goes sideways. I think this is why the controller does not removes Failed CertificateRequests today. This is not really a bug and will likely require more time and design to address.

In the meantime the easiest workaround is to look for CertificateRequests in a Failed state and remove them while being careful to not retry too much if something is really broken. If you handle a few clusters this can easily be done by hand. This becomes an issue when you manage more clusters and will require a bit of development on your side. The simplest method would be to run a cron job deleting failed CertificateRequests on a daily basis.

/milestone Next

I keep having this problem over and over (we have over 60 clusters with cert-manager installed so it’s pretty easy statistically…). Anyone had the time to have a look at that?

Challenge invalid because of a DNS problem, certificate/order never recover from there.

Reason:      Error accepting authorization: acme: authorization error for preflight.s2s.prod.xxx: 400 urn:ietf:params:acme:error:dns: During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.preflight.s2s.prod.xxx

I had some similar issues on multiple clusters with version 1.0.4, seems to appear randomly 😢 . Symptoms: some challenges are stuck in an invalid state that will never recover (i.e: 70% will be good and valid, the rest are stuck in invalid, got 12 hostnames including wildcards to validate on this particular certificate).

It takes some tinkering to unlock the situation (delete invalid challenges or delete the whole certificate request + wait for the hour retry condition to kick in).

Here are the logs (the one I have between a pod restart and when I actually connected to the cluster after some custom alerts raised my attention).

I0119 17:34:45.983004       1 setup.go:178] cert-manager/controller/issuers "msg"="ACME server URL host and ACME private key registration host differ. Re-checking ACME account registration" "related_resource_kind"="Secret" "related_resource_name
"="letsencrypt" "related_resource_namespace"="fleet-system" "resource_kind"="Issuer" "resource_name"="gateway-routing-dd963271" "resource_namespace"="fleet-system" "resource_version"="v1" 
E0119 17:34:45.987641       1 controller.go:158] cert-manager/controller/orders "msg"="re-queuing item  due to error processing" "error"="ACME client for issuer not initialised/available" "key"="fleet-system/gateway-routing-dd963271-t6ss2-339290
5940" 
I0119 17:34:45.988254       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021
-01-19T17:51:59Z"
I0119 17:34:46.987142       1 setup.go:270] cert-manager/controller/issuers "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt" "related_resource_namespace"="fleet-system
" "resource_kind"="Issuer" "resource_name"="gateway-routing-dd963271" "resource_namespace"="fleet-system" "resource_version"="v1" 
I0119 17:34:50.953838       1 setup.go:178] cert-manager/controller/issuers "msg"="ACME server URL host and ACME private key registration host differ. Re-checking ACME account registration" "related_resource_kind"="Secret" "related_resource_name
"="letsencrypt" "related_resource_namespace"="fleet-system" "resource_kind"="Issuer" "resource_name"="gateway-routing-dd963271" "resource_namespace"="fleet-system" "resource_version"="v1" 
E0119 17:34:50.988402       1 controller.go:158] cert-manager/controller/orders "msg"="re-queuing item  due to error processing" "error"="ACME client for issuer not initialised/available" "key"="fleet-system/gateway-routing-dd963271-t6ss2-339290
5940" 
I0119 17:34:51.520383       1 setup.go:270] cert-manager/controller/issuers "msg"="verified existing registration with ACME server" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt" "related_resource_namespace"="fleet-system
" "resource_kind"="Issuer" "resource_name"="gateway-routing-dd963271" "resource_namespace"="fleet-system" "resource_version"="v1" 
I0119 17:51:59.001471       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "False" -> "True"; setting lastTransitionTime to 2021-01-19 17:51:59.001455514 +0000 UTC m=+1033.276325779
I0119 17:51:59.190463       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-19 17:51:59.190452521 +0000 UTC m=+1033.465322769
I0119 17:51:59.221353       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021
-01-19T18:51:59Z"
E0119 17:51:59.251015       1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": t
he object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271" 
I0119 17:51:59.322411       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021
-01-19T18:51:59Z"
I0119 18:51:59.001656       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "False" -> "True"; setting lastTransitionTime to 2021-01-19 18:51:59.001640479 +0000 UTC m=+4633.276510746
I0119 18:51:59.148195       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-19 18:51:59.14818407 +0000 UTC m=+4633.423054320
I0119 18:51:59.192546       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021
-01-19T19:51:59Z"
E0119 18:51:59.218734       1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": t
he object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271" 
I0119 18:51:59.272712       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021
-01-19T19:51:59Z"
I0119 19:51:59.001682       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "False" -> "True"; setting lastTransitionTime to 2021-01-19 19:51:59.00166827 +0000 UTC m=+8233.276538589
I0119 19:51:59.138926       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-19 19:51:59.138915438 +0000 UTC m=+8233.413785668
E0119 19:51:59.182501       1 controller.go:158] cert-manager/controller/CertificateKeyManager "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": 
the object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271" 
I0119 19:51:59.198833       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021
-01-19T20:51:59Z"
E0119 19:51:59.225983       1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": the object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271" 
I0119 19:51:59.278316       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021-01-19T20:51:59Z"
I0119 20:51:59.001479       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "False" -> "True"; setting lastTransitionTime to 2021-01-19 20:51:59.00146562 +0000 UTC m=+11833.276335885
I0119 20:51:59.136032       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-19 20:51:59.136022062 +0000 UTC m=+11833.410892301
E0119 20:51:59.188299       1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": the object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271" 
I0119 20:51:59.191348       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021-01-19T21:51:59Z"
I0119 20:51:59.258718       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021-01-19T21:51:59Z"
I0119 21:51:59.002095       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "False" -> "True"; setting lastTransitionTime to 2021-01-19 21:51:59.002082432 +0000 UTC m=+15433.276952675
I0119 21:51:59.214110       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-19 21:51:59.214100805 +0000 UTC m=+15433.488971044
E0119 21:51:59.284494       1 controller.go:158] cert-manager/controller/CertificateKeyManager "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": the object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271" 
E0119 21:51:59.287904       1 controller.go:158] cert-manager/controller/CertificateIssuing "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": the object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271" 
I0119 21:51:59.288498       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-19 21:51:59.288486701 +0000 UTC m=+15433.563356974
I0119 21:51:59.347098       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021-01-19T22:51:59Z"
I0119 21:51:59.409006       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021-01-19T22:51:59Z"
E0119 21:51:59.438028       1 controller.go:158] cert-manager/controller/CertificateKeyManager "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": the object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271" 
I0119 22:51:59.001937       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "False" -> "True"; setting lastTransitionTime to 2021-01-19 22:51:59.00192406 +0000 UTC m=+19033.276794310
I0119 22:51:59.118746       1 conditions.go:162] Found status change for Certificate "gateway-routing-dd963271" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2021-01-19 22:51:59.118736797 +0000 UTC m=+19033.393607035
I0119 22:51:59.324823       1 trigger_controller.go:162] cert-manager/controller/CertificateTrigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="fleet-system/gateway-routing-dd963271" "retry_after"="2021-01-19T23:51:59Z"
E0119 22:51:59.390958       1 controller.go:158] cert-manager/controller/CertificateReadiness "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"gateway-routing-dd963271\": the object has been modified; please apply your changes to the latest version and try again" "key"="fleet-system/gateway-routing-dd963271"

FYI: Last time I had the problem I deleted the whole certificatrequest (which cleaned-up order and challenges), when it retried, it only issued some of the challenges (probably those that were invalid before???) before issuing the certificate successfully.

Seeing a lot of https://github.com/jetstack/cert-manager/issues/3516 problematic logs so just mentioning it here.

I can confirm the same behavior occurs in our CI runs with an uptick in failure rate in the last couple of days.

Name:         https-cert-9hnvz-1446266605-1609522933
Namespace:    ingress
Labels:       <none>
Annotations:  <none>
API Version:  acme.cert-manager.io/v1
Kind:         Challenge
Metadata:
  Creation Timestamp:  2021-02-05T12:44:26Z
  Finalizers:
    finalizer.acme.cert-manager.io
  Generation:  1
     ...
    Manager:    controller
    Operation:  Update
    Time:       2021-02-05T12:48:39Z
  Owner References:
    API Version:           acme.cert-manager.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Order
    Name:                  https-cert-9hnvz-1446266605
    UID:                   534be502-c782-4e30-bbe1-79a5787fa395
  Resource Version:        5649
  Self Link:               /apis/acme.cert-manager.io/v1/namespaces/ingress/challenges/https-cert-9hnvz-1446266605-1609522933
  UID:                     c471fb8a-e362-4755-bd45-242634255463
Spec:
  Authorization URL:  https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/203517275
  Dns Name:           default.bk-1690-8db-a.opstrace.io
  Issuer Ref:
    Kind:  Issuer
    Name:  letsencrypt-staging
  Key:     XXX
  Solver:
    dns01:
      route53:
        Region:  us-west-2
        Role:    XXX
        Secret Access Key Secret Ref:
          Name:  
    Selector:
  Token:     XXX
  Type:      DNS-01
  URL:       https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/203517275/5JupMg
  Wildcard:  false
Status:
  Presented:   false
  Processing:  false
  Reason:      Error accepting authorization: acme: authorization error for default.bk-1690-8db-a.opstrace.io: 400 urn:ietf:params:acme:error:dns: During secondary validation: DNS problem: SERVFAIL looking up TXT for _acme-challenge.default.bk-1690-8db-a.opstrace.io - the domain's nameservers may be malfunctioning
  State:       invalid
Events:
  Type     Reason     Age   From          Message
  ----     ------     ----  ----          -------
  Normal   Started    57m   cert-manager  Challenge scheduled for processing
  Normal   Presented  57m   cert-manager  Presented challenge using DNS-01 challenge mechanism
  Warning  Failed     55m   cert-manager  Accepting challenge authorization failed: acme: authorization error for default.bk-1690-8db-a.opstrace.io: 400 urn:ietf:params:acme:error:dns: During secondary validation: DNS problem: SERVFAIL looking up TXT for _acme-challenge.default.bk-1690-8db-a.opstrace.io - the domain's nameservers may be malfunctioning

Notice SERVFAIL looking up TXT instead of SERVFAIL looking up CAA as reported by @gmauleon in https://github.com/jetstack/cert-manager/issues/3594#issuecomment-773450922.

Name:         https-cert
Namespace:    ingress
Labels:       <none>
Annotations:  opstrace: owned
API Version:  cert-manager.io/v1
Kind:         Certificate
Metadata:
  Creation Timestamp:  2021-02-05T12:44:07Z
  Generation:          1
   ...
    Manager:         controller
    Operation:       Update
    Time:            2021-02-05T12:46:21Z
  Resource Version:  5040
  Self Link:         /apis/cert-manager.io/v1/namespaces/ingress/certificates/https-cert
  UID:               bfd8c512-a7c2-4ecf-9244-d4b489f3ca2d
Spec:
  Dns Names:
    bk-1690-8db-a.opstrace.io
    system.bk-1690-8db-a.opstrace.io
    *.system.bk-1690-8db-a.opstrace.io
    default.bk-1690-8db-a.opstrace.io
    *.default.bk-1690-8db-a.opstrace.io
  Issuer Ref:
    Kind:       Issuer
    Name:       letsencrypt-staging
  Secret Name:  https-cert
Status:
  Conditions:
    Last Transition Time:  2021-02-05T12:44:24Z
    Message:               Issuing certificate as Secret does not contain a private key
    Reason:                MissingData
    Status:                False
    Type:                  Ready
    Last Transition Time:  2021-02-05T12:46:21Z
    Message:               The certificate request has failed to complete and will be retried: Failed to wait for order resource "https-cert-9hnvz-1446266605" to become ready: order is in "invalid" state: 
    Reason:                Failed
    Status:                False
    Type:                  Issuing
  Last Failure Time:       2021-02-05T12:46:21Z
Events:
  Type     Reason     Age   From          Message
  ----     ------     ----  ----          -------
  Normal   Issuing    59m   cert-manager  Issuing certificate as Secret does not contain a private key
  Normal   Generated  59m   cert-manager  Stored new private key in temporary Secret resource "https-cert-gqf4b"
  Normal   Requested  59m   cert-manager  Created new CertificateRequest resource "https-cert-9hnvz"
  Warning  Failed     57m   cert-manager  The certificate request has failed to complete and will be retried: Failed to wait for order resource "https-cert-9hnvz-1446266605" to become ready: order is in "invalid" state: