gatekeeper: eval_cancel_error: caller cancelled query execution
What steps did you take and what happened: In a cluster with a large number of pods (10k+) and 40 templates, seeing the following error in the webhook pods. This is with 3 replicas. Constraints are all marked as dryrun.
{“level”:“error”,“ts”:1596050289.4440393,“logger”:“webhook”,“msg”:“error executing query”,“hookType”:“validation”,“error”:“admission.k8s.gatekeeper.sh: eval_cancel_error: caller cancelled query execution\n”,“stacktrace”:“github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/open-policy-agent/gatekeeper/pkg/webhook.(*validationHandler).Handle\n\t/go/src/github.com/open-policy-agent/gatekeeper/pkg/webhook/policy.go:198\nsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:135\nsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:87\nsigs.k8s.io/controller-runtime/pkg/webhook.instrumentedHook.func1\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/server.go:129\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2416\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2831\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1919”}
Environment:
- Gatekeeper version: v3.1.0-beta.11
- Kubernetes version: (use
kubectl version): v1.16.10
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 34 (28 by maintainers)
To narrow this down it would be useful to know how much time is spent evaluating the policy versus everything else that happens in the controller. The context can be cancelled for a variety of reasons so it could just be coincidence that the evaluator is noticing that it’s been cancelled. For example, if the deadline was set and then the request was blocked or queued, the deadline could have expired before policy evaluation even started.
If the nodes have 8 cores and the controller is only using 0.5-1.0 cores where are the rest of the cpu resources being spent? It would be helpful to know what the resource utilization looks like for the nodes where the controller is running. If the cpu resources on those nodes are maxed out, that would explain it.
In the meantime, I’ll do some benchmarking to see what numbers we could expect given this # of templates.
UPDATE:
I’ve done some benchmarking and published the results here: https://github.com/tsandall/template-benchmark
@sozercan I’d be curious to see if the baseline numbers are close to what you’re seeing in the cluster. The changes I made to the target rego could potentially be upstreamed (they reduce latency by ~4x). It’s possible there is more room for optimization in the target rego. I’ll hold off until we know (1) what the latency looks like in the controller in your test environment and (2) how heavily utilized the nodes are.