kyverno: [Bug] Webhook Controller Endless Loop

Kyverno Version

1.9.0

Kubernetes Version

1.25.x

Kubernetes Platform

AKS

Kyverno Rule Type

Other

Description

Currently we have the problem, that the webhook controller recreates all mutating and validation webhooks in an endless loop. We have currently installed kyverno in version 1.9.1 (also tested 1.9.0 with the same results)

It looks like the same problem as the following issues:

which are all closed. After further investigating and changing the log level to --v=4 we have seen, that the webhook-controller is the misbehaving part. Interestingly there a no error inside the logs about “errors” on updating or creating the webhook, so normally after the first update, the controller should stop updating the webhook if I understand the functionality correctly.

As an example i have filtered the logs for webhook-controller events:

kubectl logs -n kyverno aks-infra-kyverno-7f8bd6764b-6rgkt -f | grep webhook-controller                                                                                                                  fact-aks-infra
Defaulted container "kyverno" out of: kyverno, kyverno-pre (init)
I0310 15:34:04.892045       1 controller.go:28] setup/leader/controllers "msg"="starting controller" "name"="webhook-controller" "workers"=2
I0310 15:34:04.892061       1 run.go:19] webhook-controller "msg"="starting ..."
I0310 15:34:04.892089       1 run.go:39] webhook-controller/routine "msg"="starting routine" "id"=0
I0310 15:34:04.892477       1 run.go:30] webhook-controller/worker "msg"="starting worker" "id"=0
I0310 15:34:04.892512       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="kyverno-policy-mutating-webhook-cfg" "name"="kyverno-policy-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.892696       1 run.go:30] webhook-controller/worker "msg"="starting worker" "id"=1
I0310 15:34:04.892727       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=1 "key"="kyverno-resource-mutating-webhook-cfg" "name"="kyverno-resource-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.912706       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="23.402µs" "id"=1 "key"="kyverno-resource-mutating-webhook-cfg" "name"="kyverno-resource-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.912737       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=1 "key"="kyverno-verify-mutating-webhook-cfg" "name"="kyverno-verify-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.914684       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="35.003µs" "id"=0 "key"="kyverno-policy-mutating-webhook-cfg" "name"="kyverno-policy-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.914711       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="aks-node-mutating-webhook" "name"="aks-node-mutating-webhook" "namespace"=""
I0310 15:34:04.914725       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="13.801µs" "id"=0 "key"="aks-node-mutating-webhook" "name"="aks-node-mutating-webhook" "namespace"=""
I0310 15:34:04.914742       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="aks-webhook-admission-controller" "name"="aks-webhook-admission-controller" "namespace"=""
I0310 15:34:04.914761       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="15.902µs" "id"=0 "key"="aks-webhook-admission-controller" "name"="aks-webhook-admission-controller" "namespace"=""
I0310 15:34:04.914781       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="cert-manager-webhook" "name"="cert-manager-webhook" "namespace"=""
I0310 15:34:04.914806       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="21.701µs" "id"=0 "key"="cert-manager-webhook" "name"="cert-manager-webhook" "namespace"=""
I0310 15:34:04.914826       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="ingress-nginx-admission" "name"="ingress-nginx-admission" "namespace"=""
I0310 15:34:04.914850       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="22.402µs" "id"=0 "key"="ingress-nginx-admission" "name"="ingress-nginx-admission" "namespace"=""
I0310 15:34:04.914871       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="ingress-nginx-public-admission" "name"="ingress-nginx-public-admission" "namespace"=""
I0310 15:34:04.914889       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="22.802µs" "id"=0 "key"="ingress-nginx-public-admission" "name"="ingress-nginx-public-admission" "namespace"=""
I0310 15:34:04.914904       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="kyverno-policy-validating-webhook-cfg" "name"="kyverno-policy-validating-webhook-cfg" "namespace"=""
I0310 15:34:04.925285       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="16.101µs" "id"=1 "key"="kyverno-verify-mutating-webhook-cfg" "name"="kyverno-verify-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.925315       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=1 "key"="kyverno-resource-validating-webhook-cfg" "name"="kyverno-resource-validating-webhook-cfg" "namespace"=""
I0310 15:34:04.931332       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="17.502µs" "id"=0 "key"="kyverno-policy-validating-webhook-cfg" "name"="kyverno-policy-validating-webhook-cfg" "namespace"=""
I0310 15:34:04.931373       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="aks-node-validating-webhook" "name"="aks-node-validating-webhook" "namespace"=""
I0310 15:34:04.931396       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="24.202µs" "id"=0 
[ . . . ]

As a follow up error, we got throttling messages from the api server:

I0310 15:37:58.508759       1 request.go:614] Waited for 73.498116ms due to client-side throttling, not priority and fairness, request: PUT:https://192.168.0.1:443/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/kyverno-policy-mutating-webhook-cfg

To get the number of update, that will be performed by the webhook-controller, we watched the current version of the webhook. After around 5 minutes, we got already 2846 Updates:

$ kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-cleanup-validating-webhook-cfg
NAME                                     WEBHOOKS   AGE
kyverno-cleanup-validating-webhook-cfg   1          5m4s

$ kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-cleanup-validating-webhook-cfg -o yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  creationTimestamp: "2023-03-10T14:49:19Z"
  generation: 2846
  labels:
    webhook.kyverno.io/managed-by: kyverno
  name: kyverno-cleanup-validating-webhook-cfg
  resourceVersion: "245823835"
  uid: 06eb8596-4d13-4264-a651-c8fabef64cdd
[ . . . ]

Also we tested to delete all ClusterPolicies and Policies, but that changed nothing on the endless loop behavior.

As a last hint, we also see this behavior with other aks clusters and different kubernetes versions (Kubernetes Versions 1.24 and 1.25).

Maybe someone has an idea, what we can check or try, in order to get the correct behavior again.

Thanks in advance Dennis

PS: The full logs can be found on pastebin: https://pastebin.com/MYpxAKe4

Steps to reproduce

  1. Install Kyverno 1.9.X inside an aks managed kubernetes cluster with 3 replicas (Helm Chart Version 2.7.1)
  2. Watch the updating webhooks in an endless loop

Expected behavior

After the first correct creation of the webhook configs, the webhook-controller should stop updating the webhooks.

Screenshots

No response

Kyverno logs

No response

Slack discussion

No response

Troubleshooting

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 43 (25 by maintainers)

Most upvoted comments

Glad I found this issue as we at @swisspost also face this issue on our AKS clusters.

Like @dhemeier posted in some comments above, I also found it inside the FAQ of AKS: https://learn.microsoft.com/en-us/azure/aks/faq#can-admission-controller-webhooks-impact-kube-system-and-internal-aks-namespaces

Can admission controller webhooks impact kube-system and internal AKS namespaces?

To protect the stability of the system and prevent custom admission controllers from impacting internal services in the kube-system, namespace AKS has an Admissions Enforcer, which automatically excludes kube-system and AKS internal namespaces. This service ensures the custom admission controllers don’t affect the services running in kube-system.

If you have a critical use case for deploying something on kube-system (not recommended) in support of your custom admission webhook, you may add the following label or annotation so that Admissions Enforcer ignores it.

Label: "admissions.enforcer/disabled": "true" or Annotation: "admissions.enforcer/disabled": true

-> We would also appreciate a backported fix in 1.9.x

1.9.3 is available with this backport: https://github.com/kyverno/kyverno/releases/tag/v1.9.3

FunFact, found an admission webhook in generation 45.678.532 😄