gatekeeper: eval_cancel_error: caller cancelled query execution

What steps did you take and what happened: In a cluster with a large number of pods (10k+) and 40 templates, seeing the following error in the webhook pods. This is with 3 replicas. Constraints are all marked as dryrun.

{“level”:“error”,“ts”:1596050289.4440393,“logger”:“webhook”,“msg”:“error executing query”,“hookType”:“validation”,“error”:“admission.k8s.gatekeeper.sh: eval_cancel_error: caller cancelled query execution\n”,“stacktrace”:“github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/open-policy-agent/gatekeeper/pkg/webhook.(*validationHandler).Handle\n\t/go/src/github.com/open-policy-agent/gatekeeper/pkg/webhook/policy.go:198\nsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:135\nsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:87\nsigs.k8s.io/controller-runtime/pkg/webhook.instrumentedHook.func1\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/server.go:129\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2416\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2831\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1919”}

Environment:

Gatekeeper version: v3.1.0-beta.11
Kubernetes version: (use kubectl version): v1.16.10

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 34 (28 by maintainers)

Most upvoted comments

To narrow this down it would be useful to know how much time is spent evaluating the policy versus everything else that happens in the controller. The context can be cancelled for a variety of reasons so it could just be coincidence that the evaluator is noticing that it’s been cancelled. For example, if the deadline was set and then the request was blocked or queued, the deadline could have expired before policy evaluation even started.

@sozercan: nodes are Standard_DS4_v2, which is 8 cores and 28gigs of mem

If the nodes have 8 cores and the controller is only using 0.5-1.0 cores where are the rest of the cpu resources being spent? It would be helpful to know what the resource utilization looks like for the nodes where the controller is running. If the cpu resources on those nodes are maxed out, that would explain it.

In the meantime, I’ll do some benchmarking to see what numbers we could expect given this # of templates.

UPDATE:

I’ve done some benchmarking and published the results here: https://github.com/tsandall/template-benchmark

@sozercan I’d be curious to see if the baseline numbers are close to what you’re seeing in the cluster. The changes I made to the target rego could potentially be upstreamed (they reduce latency by ~4x). It’s possible there is more room for optimization in the target rego. I’ll hold off until we know (1) what the latency looks like in the controller in your test environment and (2) how heavily utilized the nodes are.

tsandall on Jul 31, 2020