gatekeeper: [Performance issue] validatingwebhookconfiguration too slow
I am having a performance issue with Gatekeeper.
Notes:
- All
constraints deployed are set toenforcementAction: dryrun
Observations:
-
A
daemonsetrunning on a 50(ish) node K8s cluster where there is a leadership election every 1 minute. I am seeing this sort of error as soon as I deploy Gatekeeper:Failed to update lock: Timeout: request did not complete within allowed duration– removal ofGatekeeperand accompanyingvalidatingwebhookconfigurationresolves this issue. -
I am seeing errors from our Spinnaker deployment pipelines, such as this:
Deploy failed: Error from server (Timeout): error when applying patch: {"meta... .. .. .. plicas":'\x01']]} for: "STDIN": Timeout: request did not complete within allowed duration– again, removal ofGatekeeperresolves this issue. -
Resource usage by Gatekeeper is minimal (only using about half the
requests):
NAME CPU(cores) MEMORY(bytes)
gatekeeper-controller-manager-0 47m 132Mi
I do not really want to remove any resource types from the validatingwebhookconfiguration - I want everything to be validated, and as such whitelisting certain objects at that level is not ideal. I would much rather understand if there is any way in which the current model can be optimised so as not to take as long as it does.
-
Have other people observed this issue? If so, what has been done to mitigate?
-
Is it possible that all the constraints being in
enforcementAction: dryrunis causing thegatekeeper-controller-managerto queue up requests and therefore causing timeouts at the k8s apiserver for resource create/updates…? -
Are there any obvious things that I am doing wrong? I can provide as much information as needed - let me know!!
Thank you in advance
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (12 by maintainers)
I have same problem on my cluster. As shown in the figure below, the latency of admission-webhook took about 2 seconds at the 99th percentile. Most requests come from Node status updates.
When Gatekeeper is disabled, the latency is within 10msec.
My cluster is …