cert-manager: Controller can't handle hitting request rate limits of zerossl ACME API

Describe the bug:

We’ve been using cert-manager with zerossl as ACME provider using http01 challenges for several months now vey successfully. However, since a couple of weeks ago, zerossl must have changed their ACME API: They now introduced a quite strict request rate limit. Whenever issuing a new certificate containing 3 or more domains and using the http01 challenge, we are running in 429 responses from their API, which completely bricks the cert issue flow. Note: The problem does not occur when issuing a cert containing <=2 domains.

Expected behaviour: The controller should respect 429 responses and try again later. In my case, retrying 2-3 seconds later would already solve the issue.

Steps to reproduce the bug: This is the certificate resource:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  annotations:
    service: tls-cert
  labels:
    service: tls-cert
  name: tls-cert
spec:
  dnsNames:
  - xxx
  - xxx
  - xxx
  - xxx
  - xxx
  - xxx
  - xxx
  - xxx
  issuerRef:
    group: cert-manager.io
    kind: ClusterIssuer
    name: zerossl
  secretName: tls-cert
  usages:
  - digital signature
  - key encipherment

And this is the ClusterIssuer resource:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: zerossl
spec:
  acme:
    externalAccountBinding:
      keyID: xxxxx
      keySecretRef:
        key: eab-hmac-key
        name: zerossl
    privateKeySecretRef:
      name: zerossl-account
    server: https://acme.zerossl.com/v2/DV90
    solvers:
    - http01:
        ingress:
          class: nginx

After applying the certificate to the cluster, the corresponding CertificateRequest, Order, and Challenge resources are created as expected. However, during processing of the challenges, the ACME client hits the request limit of the zerossl API: challenge

# failed challenge status:
status:
  presented: false
  processing: false
  reason: 'Failed to retrieve Order resource: 429 : 429 Too Many Requests'
  state: errored

Once the first challenge fails, the error state is propagated to the Order and Certificate resource:

# Order status:
status:
  authorizations:
    ....
  failureTime: "2023-03-16T10:26:15Z"
  finalizeURL: https://acme.zerossl.com/v2/DV90/order/xxxxx/finalize
  reason: "Failed to retrieve Order resource: 429 : <html>\r\n<head><title>429 Too
    Many Requests</title></head>\r\n<body>\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"
  state: errored
  url: https://acme.zerossl.com/v2/DV90/order/xxxxx

# Certificate status:
status:
  conditions:
  - lastTransitionTime: "2023-03-16T10:26:08Z"
    message: Issuing certificate as Secret does not exist
    observedGeneration: 1
    reason: DoesNotExist
    status: "False"
    type: Ready
  - lastTransitionTime: "2023-03-16T10:26:15Z"
    message: "The certificate request has failed to complete and will be retried:
      Failed to wait for order resource \"tls-cert-twhmq-1698200363\" to become ready:
      order is in \"errored\" state: Failed to retrieve Order resource: 429 : <html>\r\n<head><title>429
      Too Many Requests</title></head>\r\n<body>\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"
    observedGeneration: 1
    reason: Failed
    status: "False"
    type: Issuing
  failedIssuanceAttempts: 1
  lastFailureTime: "2023-03-16T10:26:15Z"

Anything else we need to know?:

It seems that for every challenge, the order is retrieved from the acme API. The more domains in the certificate, the more challenges are being spawned, and thus the more requests to fetch the order object are being made.

I see two technical issues here:

  • upon retrieval of a 429 response code, the controller should retry instead of giving up immediately
  • in order to ease the pressure on the ACME API, the order response should be cached

I have informed the technical support of zerossl about this issue. Their suggestion was to throttle the requests and/or implement a retry.

Environment details::

  • Kubernetes version: 1.24
  • Cloud-provider/provisioner: GKE
  • cert-manager version: 1.11.0
  • Install method: helm 1.11.0 /kind bug

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 15
  • Comments: 42 (1 by maintainers)

Commits related to this issue

Most upvoted comments

@baszalmstra, here is an example of how it worked for me:

  1. Followed this guide to create EAB secret details through gcloud CLI locally: https://cloud.google.com/certificate-manager/docs/public-ca-tutorial
  2. Created secret resource:
apiVersion: v1
kind: Secret
metadata:
  name: gcp-cm-eabsecret
data:
  secret: {{ .Values.gcpCmEabsecret | b64enc }}
  1. Created ClusterIssuer resource:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: gcp-cm
spec:
  acme:
    # Google Certificate Manager Public CA ACME server
    server: https://dv.acme-v02.api.pki.goog/directory
    email: <your_email>

    # name of a secret used to store the ACME account private key
    privateKeySecretRef:
      name: gcp-cm

    # for each cert-manager new EAB credentials are required
    externalAccountBinding:
      keyID: <your_key_id>
      keySecretRef:
        name: gcp-cm-eabsecret
        key: secret
      keyAlgorithm: HS256

    # ACME DNS-01 provider configurations to verify domain
    solvers:
      - dns01:
          cloudDNS:
            project: <your_project_id>

Update from zerossl support: They are looking deeper into this. It might be a technical issue on their side, after all.

@sgsollie Thats pretty cool! Would you be able to share how you set that up? What does your issuer configuration look like?

I’ve just published an article showing how we set up cert-manager to use Google’s Public CA. https://www.uffizzi.com/blog/ditching-zerossl-for-google-public-certificate-authority-for-ssl-certificates-via-cert-manager-and-acme-protocol

cc @baszalmstra

I’ve been writing back and forth with the zerossl support. Unfortunately it looks like they are not interested in understanding the problem nor helping me with the issue.

They keep telling me take a look at their documentation. The only related information given is not specific, so it doesn’t help at all:

Configure your scripts and clients to use our free of charge ACME API in a meaningful way. We want to provide a reliable and stable service to all our customers, malicious users can be limited or even blocked.

I’ve been specifically asking for more information about rate limiting.

The gist of their answer, from oldest to newest:

Regarding this, there are no rate limits for your account per se but there are some limits we have to adhere to on our end, to prevent flooding / too many requests. In this case, I’d advise staggering your requests over a longer period of time as well as retrying these if they keep falling when they fail after 15 minutes or so.

[…] as the limit is for the whole endpoint, you may get these [429 responses] from time to time when our service is under particularly heavy load. Retrying is the only way to go for now. I am discussing the matter with our developers so I’ll let you know if we have any news from our side.

Some news on the matter - we are investigating it more closely with our developers. It seems that there might be a problem we were not aware of on our side as well.

We have looked into the issue with our developers. The HTTP 429 code is a response from our infrastructure, indicating that our ACME endpoint is receiving more requests than we can currently process. The limit for the whole endpoint is variable and we adjust it periodically, to keep up with demand. As the ACME API is provided free of charge - some users unfortunately abuse the endpoint in order to issue huge amounts of certificates, which unfortunately has an effect on legitimate users such as you. Abusive users are of course blocked to free up capacity. In practical terms, the way to deal with this is to retry later. If you are having issues over an extended period of time (Few days or a week) and have retried it lots of times without success, please let us know.

At this point, I have given up on the zerossl support. I don’t think they will fix the issue on their side. It’s a shame though: zerossl is otherwise a perfect match for cert-manager. I wanted it to be the go-to provider whenever the cert rate limits of letsencrypt don’t suffice. In case anyone knows a viable alternative to zerossl, please let me know.

I’m wondering whether using DNS challenges instead of HTTP-challenges would help. Does anyone know if using DNS challenges would send less requests to the order endpoint?

I suspect that zerossl using overall rate limiter, because sometimes I observed 429 Too Many Requests even when there are only 1-2 certificate requests. If true then it’s not fair for everyone because some can request a lot of certificates while other are unable to request any

Just adding a datapoint, We migrated away from ZeroSSL for this very reason. We’re in google cloud, so we’re now using the GCP Public CA with the ACME issuer & have had no problems since: https://cloud.google.com/certificate-manager/docs/public-ca

Thank you for looking into the issue.

I didn’t know the --max-concurrent-challenges flag existed. This sounds like it could really help.

I have deployed cert-manager with --max-concurrent-challenges=1. The first 6 challenges succeed. However, the last two fail due to 429.

At least this time, the order is not in state failed but still in pending. This makes it easier to ‘reset’ the challenges. I’ve written a workaround script which set’s the status of the challenges back to pending. cert-manager then picks up processing the challenges again. In case someone has use for this:

#!/bin/bash
set -eu

failed_challenges=$(
    kubectl get challenge \
        --all-namespaces \
        -o json \
        | jq '.items
            | map(select(.status.state == "errored" and (.status.reason | contains("429"))))
            | map({ name: .metadata.name, namespace: .metadata.namespace })
            | .[]
            ' -c
    )

for challenge in $failed_challenges; do
    echo fixing up invalid challenge: $challenge
    name=$(echo "$challenge" | jq '.name' -r)
    namespace=$(echo "$challenge" | jq '.namespace' -r)
    kubectl patch challenge \
        -n "$namespace" \
        --type merge \
        --subresource status \
        --patch 'status: {state: pending}' \
        $name
done

whether some of the calls could be reduced

We might be able to cut down the number of HTTP01ChallengeResponse calls to be the same as the number of required authorizations (basically the number of DNS names) for success path. I’ll give that a go, that’s not guaranteed to make it work with ZeroSSL though.

Thank you for your response. I am well aware of the letsencrypt rate limits. They are unsuitable for our use case, hence we moved to zerossl. In their basic plan (which we use), the amount of certificates are not limited in any way.

In my experience this should mean something along the lines of ‘too many certificate requests (have been created)’ not ‘the existing order has been requested too many times’.

I have reached out to the zerossl support, they have confirmed that a) our account is by no means limited in terms of certificates b) they have implemented general rate limiting on their API.

Therefore I am 100% sure the error message is not in regards to the amount of certificates in general, but in regards to the amount of requests being sent per second. FYI, as mentioned earlier I have no problem issuing new certs with <=2 domains, even directly after hitting the rate limit with a certificate which has >2 domains.

I have played around with the failed resources and manually resetted their .status.state field to pending. Directly afterwards the challenge was conducted successfully.

I have invested some time to build a workaround using shell-operator to reset the .status.state fields automatically whenever it has errored and the message contained 429. While this most of the time works, monkey-patching the cert-manager resources seems to be a bad practice. I failed to get it working smoothly since obviously cert-manager should be the only controller altering the resource state. However, I derive from this experiment that my assumption is right: If cert-manager would simply retry fetching the order resource, or issue the requests in a slightly staggered fashion, or use a cached response, the problem would be solved. The new request rate limit in the zerossl api seems to be set up in a way which blocks request spikes to the same resource over a short period of times (i.e., couple of seconds).