cert-manager: Controller fails to process new certs when there are a large number of pending ones

This issue is a duplicate of https://github.com/jetstack/cert-manager/issues/3772. I thought it’d be more appropriate to present this as a bug as my concern is not to be able to adjust the control loop size, but rather to prevent any certificates from getting stuck in the issuing process.

Describe the bug: In one of my clusters, I have about 60 challenge objects that remain in the pending state due to incorrect DNS records (knowingly so). The cm-acme-http-solver- pods and ingresses are created and left hanging. When the number of such challenge objects grows sufficiently (I don’t have an accurate number), new challenges encounter the following error.

E0315 12:33:57.042284       1 controller.go:158] cert-manager/controller/CertificateKeyManager "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"INGRESS_NAME-tls\": the object has been modified; please apply your changes to the latest version and try again" "key"="NAMESPACE/INGRESS_NAME-tls" 

The issue is immediately resolved when I remove just a few of the ingresses causing pending challenges.

Expected behaviour: It is expected that certificates, which would normally be issued, do not get stuck due to the sheer number of pending challenges.

Steps to reproduce the bug:

  1. Create enough ingress objects containing hosts with incorrect DNS records so their corresponding challenge objects remain pending.
  2. Create a normal ingress.
  3. Watch it fail to be processed by the control loop.

Environment details::

  • Kubernetes version: 1.18
  • cert-manager version: 1.1.0
  • Install method: helm chart v1.1.0

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 27 (7 by maintainers)

Most upvoted comments

I think we could look into expiring challenges after a period of time after which they haven’t succeeded.

Short term solution would be to document this potential pitfall.

@maelvls Any news regarding this issue?

I’ll take a look tomorrow morning.

/assign

Update 10 Feb 2022: I wasn’t able to investigate yet.