cert-manager: Cert-manager causes API server panic on clusters with more than 20000 secrets.

📢 UPDATE: 2023-11-02 Please read the memory scalability section of the Best Practice documentation which now explains how to configure cainjector to only watch Secret resources in the cert-manager namespace. This should resolve some of the problems described in this issue.

Describe the bug: On clusters with more than 20000 secrets this becomes a problem . The query that Cert manager does is not optimal. /api/v1/secrets?limit=500&resourceVersion=0

resourceVersion=0 will cause to query always all secrects and limit=500 will not be taken into account. This way cert manager is not scalable for large deployments. Secrets are used not only for certificates.

As mentioned in : kubernetes/kubernetes#56278 and https://kubernetes.io/docs/reference/using-api/api-concepts/

I suggest to remove the resourceVersion=0 from the query which should make it a lot more faster.

Furhtermore cert manager will retry those queries without waiting for them to complete and they pile up and cause significant load even crashes on the API server. Cert manager basically DDoS’es the Api server.

We’re hitting the same issue with: quay.io/jetstack/cert-manager-cainjector:v0.11.0 quay.io/jetstack/cert-manager-controller:v0.11.0 and quay.io/jetstack/cert-manager-controller:v1.1.0 quay.io/jetstack/cert-manager-cainjector:v1.1.0

Logs from API server


E0115 18:27:27.893242       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
goroutine 79221267 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3b1fda0, 0xc0001c6650)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0feb65c90, 0x1, 0x1)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc08dc08740, 0xc09ea59b80)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc019eb1960, 0x4edf040, 0xc0a1206af0, 0xc07749d900)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x310
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x4edf040, 0xc0a1206af0, 0xc07749d800)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:47 +0x10f
net/http.HandlerFunc.ServeHTTP(0xc0434bf3e0, 0x4edf040, 0xc0a1206af0, 0xc07749d800)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x274
net/http.HandlerFunc.ServeHTTP(0xc0434bf470, 0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc019eb1a20, 0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca
net/http.HandlerFunc.ServeHTTP(0xc019eb1a40, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x13e
net/http.HandlerFunc.ServeHTTP(0xc019eb1a60, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc0434bf4a0, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc009896a80, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2802 +0xa4
net/http.initNPNRequest.ServeHTTP(0x4eeb300, 0xc06df08a50, 0xc07a0df180, 0xc009896a80, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:3366 +0x8d
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc094106480, 0xc0c4b6b240, 0xc07749d500, 0xc08dc08340)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2149 +0x9f
created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1883 +0x4eb
E0115 18:27:27.893364       1 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0
I0115 18:27:27.893567       1 log.go:172] http2: panic serving 10.148.0.16:53202: killing connection/stream because serving request timed out and response had been started
goroutine 79221267 [running]:
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler.func1(0xc0c4b6b240, 0xc0feb65f67, 0xc094106480)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2142 +0x16b
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0feb65c90, 0x1, 0x1)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc08dc08740, 0xc09ea59b80)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*time

Logs from ETCD:

...
Dec 11 09:21:08 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:21:08.106948 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 5.348150525s)
Dec 11 09:21:08 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:21:08.106954 W | etcdserver: server is likely overloaded
posle bavni no uspqvashti ...
Dec 11 09:23:26 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:23:26.433315 W | etcdserver: read-only range request "key:\"/registry/persistentvolumes/pvc-f31decea-7a39-4d11-bbbf-8eb45f433239\" " with result "range_response_count:1 size:1017" took too long (13.750148565s) to execute

Logs from cert-manager:

E0203 15:18:34.063192       1 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0

E0203 15:18:33.969252       1 reflector.go:123] external/io_k8s_client_go/tools/cache/reflector.go:96: Failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 37511; INTERNAL_ERROR

Expected behaviour: cert-manager to not try making heavy queries that need to query all secrets from all namespaces, but instead work per namespace.

Steps to reproduce the bug:

Generate 15000 secrets - no need for them to be for TLS certificates, any secret will do. Look at the API server load and Cert-manager logs

Anything else we need to know?:

Environment details::

  • Kubernetes version: Kubernetes v1.16.13
  • Cloud-provider/provisioner: Vanilla K8s
  • cert-manager version: v1.1.0
  • Install method: helm (with CRDs applied before that)

/kind bug

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 11
  • Comments: 28 (10 by maintainers)

Most upvoted comments

On a cluster that we’ve only installed the CRDs and don’t have any certificates actually managed by cert-manager, the controller makes a call for all secrets - on that cluster we have about 130k secrets. here’s the log from the controller:

W0528 09:54:32.599392       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0528 09:54:32.600405       1 controller.go:171] cert-manager/controller/build-context "msg"="configured acme dns01 nameservers" "nameservers"=["192.168.0.2:53"]
I0528 09:54:32.600928       1 controller.go:72] cert-manager/controller "msg"="enabled controllers: [certificaterequests-approver certificaterequests-issuer-acme certificaterequests-issuer-ca certificaterequests-issuer-selfsigned certificaterequests-issuer-vault certificaterequests-issuer-venafi certificates-issuing certificates-key-manager certificates-metrics certificates-readiness certificates-request-manager certificates-revision-manager certificates-trigger challenges clusterissuers ingress-shim issuers orders]"
I0528 09:54:32.601253       1 controller.go:131] cert-manager/controller "msg"="starting leader election"
I0528 09:54:32.601454       1 metrics.go:166] cert-manager/controller/build-context/metrics "msg"="listening for connections on" "address"={"IP":"::","Port":9402,"Zone":""}
I0528 09:54:32.601794       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/cert-manager-controller...
I0528 09:55:37.453032       1 leaderelection.go:253] successfully acquired lease kube-system/cert-manager-controller
I0528 09:55:37.453538       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-ca"
I0528 09:55:37.453606       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-key-manager"
I0528 09:55:37.453656       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-revision-manager"
I0528 09:55:37.453668       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="orders"
I0528 09:55:37.453695       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="ingress-shim"
I0528 09:55:37.453763       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-request-manager"
I0528 09:55:37.453774       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-approver"
I0528 09:55:37.453878       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="clusterissuers"
I0528 09:55:37.453894       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-acme"
I0528 09:55:37.453918       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-vault"
I0528 09:55:37.453935       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-metrics"
I0528 09:55:37.453980       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-readiness"
I0528 09:55:37.454010       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="issuers"
I0528 09:55:37.454038       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-selfsigned"
I0528 09:55:37.454097       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-venafi"
I0528 09:55:37.454115       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-issuing"
I0528 09:55:37.454163       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-trigger"
I0528 09:55:37.454975       1 reflector.go:207] Starting reflector *v1.Secret (5m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:46.968845       1 trace.go:205] Trace[1041222873]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:156 (28-May-2021 09:55:37.454) (total time: 69513ms):
Trace[1041222873]: ---"Objects listed" 69132ms (09:56:00.587)
Trace[1041222873]: [1m9.513751159s] [1m9.513751159s] END
I0528 09:56:47.655791       1 reflector.go:207] Starting reflector *v1beta1.Ingress (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655818       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="challenges"
I0528 09:56:47.655860       1 reflector.go:207] Starting reflector *v1.Certificate (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655790       1 reflector.go:207] Starting reflector *v1.ClusterIssuer (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655918       1 reflector.go:207] Starting reflector *v1.Service (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655935       1 reflector.go:207] Starting reflector *v1.Secret (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655955       1 reflector.go:207] Starting reflector *v1.Challenge (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655971       1 reflector.go:207] Starting reflector *v1.Pod (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655860       1 reflector.go:207] Starting reflector *v1.Order (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.656192       1 reflector.go:207] Starting reflector *v1.CertificateRequest (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.656238       1 reflector.go:207] Starting reflector *v1.Issuer (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
W0528 09:56:47.661623       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
W0528 09:56:47.665090       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
I0528 09:58:04.063808       1 trace.go:205] Trace[1899623133]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:156 (28-May-2021 09:56:47.655) (total time: 76407ms):
Trace[1899623133]: ---"Objects listed" 76019ms (09:58:00.674)
Trace[1899623133]: [1m16.407768851s] [1m16.407768851s] END
W0528 10:02:23.667984       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress

LE: I want to be clear, that in my case the API server doesn’t crash, however the fact that cert-manager makes this call is problematic. We’re using k8s 1.19 and cert-manager 1.3.1

@wallrj No that is not applicable in our case as we use cainjector for inject things for webooks in multiple namespaces.

⚠️️ This optimization is only appropriate if cainjector is being used exclusively for the the cert-manager webhook. It is not appropriate if cainjector is also being used to manage the TLS certificates for webhooks of other software. For example, some Kubebuilder derived projects may depend on cainjector to inject TLS certificates for their webhooks.