kyverno: [BUG] Validation webhook fails and stops resource interactions every time kyverno Helmrelease is down

Software version numbers

  • Kubernetes version: 1.21.2
  • Kubernetes platform (if applicable; ex., EKS, GKE, OpenShift): AKS
  • Kyverno version: v1.5.1

Describe the bug If, for whatever reason, the Kyverno release is down (in my case it was an OOM error in only one of the environments) without deleting the deployment or helmrelease of Kyverno, the validation webhook kyverno-resource-validating-webhook-cfg will fail to allow interaction with Kubernetes resources.

This should not be the behaviour, especially if the kyverno-policies validationFailureAction is in audit mode.

To Reproduce Steps to reproduce the behavior:

  1. Run Kyverno and Kyverno-policies helmrelease. helm install kyverno kyverno/kyverno --namespace kyverno --create-namespace helm install kyverno-policies kyverno/kyverno-policies --namespace kyverno
  2. Make Kyverno pod(s) unavailable without deleting validatingwebhookconfigurations to simulate an error. (i.e edit pod and change image url) k edit <kyverno-pod> -n kyverno

Expected behavior The webhook shouldn’t disallow creating pods, deleting pods etc… Especially if it’s a policy that’s only in audit. Auditing policies should not be able to affect other namespaces when Kyverno is having an error.

Screenshots

Additional context After steps 1 and 2: Cannot even delete a pod: k delete <pod-1> image

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 22 (10 by maintainers)

Most upvoted comments

If can run Kyverno with multiple replicas to prevent single-point of failure. If all Kyverno instances fail for some reason, we’ll need to troubleshoot that separately.

With default policies from chart kyverno-policies, using multiple replicas on Kyverno doesn’t prevent API server requests to be rejected when deleting the deployment, for example. All kyverno pods will be set to TERMINATING, going the the NOT-READY state. The endpoints from service kyverno-svc will all be removed, and the API server will not have any endpoint for validating the requests. This will block further updates of any pod, etc. including kyverno themselves, they won’t be able to really terminate. Not even sure that the process inside the containers will be sent a signal for shutting down. As a result of the processes not shutting down, the ValidatingWebhookConfiguration object will not be auto-deleted, and everything will be locked until someone will delete the ValidatingWebhookConfiguration.

If you’ve specified --autoUpdateWebhooks=false, you can additionally configure namespace selector to exclude kyverno’s own namespace. That should prevent situation described by @demikl from happening.

Example values for helm:

extraArgs:
- --autoUpdateWebhooks=false

config:
  webhooks:
  - namespaceSelector:
      matchExpressions:
      - key: kubernetes.io/metadata.name
        operator: NotIn
        values:
        - kyverno
        - kube-system

This is currently not possible to do without setting --autoUpdateWebhooks=false, but there’s a ticket for that: #2320

@admincasper - for the original issue described on top, you need to set spec.failurePolicy to Ignore to let admission requests pass if Kyverno is not responding.

In my case, I have installed the same sample policies with Ignore failure policy, and edited container to use invalid image, I was able to create/delete pods. Tested against v1.5.4.

~ k get cpol -o wide
NAME                             BACKGROUND   ACTION   FAILURE POLICY   READY
disallow-add-capabilities        true         audit    Ignore           true
disallow-host-namespaces         true         audit    Ignore           true
disallow-host-path               true         audit    Ignore           true
disallow-host-ports              true         audit    Ignore           true
disallow-privileged-containers   true         audit    Ignore           true
disallow-selinux                 true         audit    Ignore           true
require-default-proc-mount       true         audit    Ignore           true
restrict-apparmor-profiles       true         audit    Ignore           true
restrict-sysctls                 true         audit    Ignore           true
~ k get -n kyverno pod -w
NAME                       READY   STATUS             RESTARTS   AGE
kyverno-565957c8d5-kt6n2   0/1     ImagePullBackOff   0          5m30s
 ~ k run nginx --image=nginx:latest
pod/nginx created

~ k delete pod nginx
pod "nginx" deleted