cert-manager: High memory and CPU consumption in cert-manager-cainjector

Describe the bug: After upgrading cert-manager to version 1.12.2 from version 1.11.0 we noticed pretty high memory and CPU consumption in cert-manager-cainjector pod that keeps rising slowly until it reaches the node limit. After restarting the deployment, memory and CPU drops but it slowly start to rise again. You can see the details in the graph below:

Historic Data of the deployment: image

Current Memory and CPU:

cert-manager-xxxxx-xxxxx             1356m        1398Mi          
cert-manager-cainjector-xxxxx-xxxxx  9767m        31189Mi         
cert-manager-webhook-xxxxx-xxxxx        1m           13Mi

Expected behaviour: Memory consumption of cert-manager-cainjector pod should be more than 350MB and CPU no more than 0.002 cores

Steps to reproduce the bug: Upgrade from version 1.11.0 to version 1.12.2.

Anything else we need to know?: We use the default values from helm chart found here https://artifacthub.io/packages/helm/cert-manager/cert-manager

Environment details:: Production

  • Kubernetes version: 1.26.5-gke.1200
  • Cloud-provider/provisioner: GKE
  • cert-manager version: v1.12.2
  • Install method: helm/helmfile

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 24 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Same here, no sharp ramps with v1.12.3.

We deployed v1.12.3 and it seems to be much better, thanks for the fix!

Screenshot from 2023-07-26 17-15-35 Screenshot from 2023-07-26 17-15-23

Good results here too on AWS EKS v1.24.16 with a few thousand certificates. We re-upgraded from 1.11.4 (after downgrading back to 1.11.4 from 1.12.2).

Thanks for the fix !

Updating from v1.12.0 to v1.12.3 fixed the CPU/Memory usage spike for me on Digital Ocean Kubernetes v1.27.4-do.0

@zeeZ do you have memory profiles too?

I’ve let it run for about an hour until it ramped up to consume about one CPU core with the following args, here’s the result:

      --leader-election-namespace=cert-manager
      --enable-profiling=true
      --profiler-address=:8081
      --leader-elect=false
      --enable-certificates-data-source=false
      --enable-customresourcedefinitions-injectable=false
      --enable-apiservices-injectable=false

cainjector.log pprof.cainjector.goroutine.001.pb.gz pprof.cainjector.samples.cpu.001.pb.gz pprof.cainjector.threadcreate.001.pb.gz pprof.cainjector.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

The secret total in the logs seems a bit wild. There’s only 294 secrets with a total of 577 keys in the cluster

based on @zeeZ’s pprof dumps, this might be related to the logging issue that was reported here: #6104 image

No json logging, but klog is definitely up there. I’ll see if I can run the the linked fix in the cluster tomorrow.

I have four different clusters, and only one of them shows this behavior. They’re all running version 1.12.2 of the helm chart with identical configuration.

Luckily the one with the CPU ramp is a dev cluster, so I have access to the profiler, just don’t know how to use it 🙃

This is a 30 minute old cainjector consuming one cpu core: pprof.cainjector.samples.cpu.pb.gz

Edit: after 3 hours, up to 500% cpu usage: pprof.cainjector.samples.cpu_3h.pb.gz

@lboix For starters we kept it as is without rolling back and we rollout/restart the cert-manager-cainjector deployment periodically. However, depending of the size of the cluster, the leak was rising quickly sometimes. So we rolled back to v1.11.4 now and it seems that there is no leak in that version.