gatekeeper: [Performance issue] validatingwebhookconfiguration too slow

I am having a performance issue with Gatekeeper.

Notes:

  • All constraints deployed are set to enforcementAction: dryrun

Observations:

  • A daemonset running on a 50(ish) node K8s cluster where there is a leadership election every 1 minute. I am seeing this sort of error as soon as I deploy Gatekeeper: Failed to update lock: Timeout: request did not complete within allowed duration – removal of Gatekeeper and accompanying validatingwebhookconfiguration resolves this issue.

  • I am seeing errors from our Spinnaker deployment pipelines, such as this: Deploy failed: Error from server (Timeout): error when applying patch: {"meta... .. .. .. plicas":'\x01']]} for: "STDIN": Timeout: request did not complete within allowed duration – again, removal of Gatekeeper resolves this issue.

  • Resource usage by Gatekeeper is minimal (only using about half the requests):

NAME                              CPU(cores)   MEMORY(bytes)
gatekeeper-controller-manager-0   47m          132Mi

I do not really want to remove any resource types from the validatingwebhookconfiguration - I want everything to be validated, and as such whitelisting certain objects at that level is not ideal. I would much rather understand if there is any way in which the current model can be optimised so as not to take as long as it does.

  • Have other people observed this issue? If so, what has been done to mitigate?

  • Is it possible that all the constraints being in enforcementAction: dryrun is causing the gatekeeper-controller-manager to queue up requests and therefore causing timeouts at the k8s apiserver for resource create/updates…?

  • Are there any obvious things that I am doing wrong? I can provide as much information as needed - let me know!!

Thank you in advance

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (12 by maintainers)

Most upvoted comments

I have same problem on my cluster. As shown in the figure below, the latency of admission-webhook took about 2 seconds at the 99th percentile. Most requests come from Node status updates.

gatekeeper_metrics

When Gatekeeper is disabled, the latency is within 10msec.

My cluster is …

  • My cluster has about 30 nodes.
  • Only one constraint template.
package networkPolicyOrder	
operations = {"CREATE", "UPDATE"}	

violation[{"msg": msg, "details": {"order": order}}] {	
  operations[input.review.operation]	
  matched := {ns | ns := input.parameters.systemNamespaces[i]; ns == input.review.namespace}	
  count(matched) == 0	
  order := input.review.object.spec.order	
  order <= input.parameters.limitOrder	
  msg := sprintf("cannot create/update non-system NetworkPolicy with order <= %v", [input.parameters.limitOrder])	
}
  • Gatekeeper has deployed with default yaml. controller-manager has limited cpu resource.
resources:	
  limits:	
    cpu: 100m	
    memory: 512Mi	
  requests:	
    cpu: 100m	
    memory: 256Mi