kyverno: Error from server (InternalError): Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": Post "https://kyverno-svc.kyverno.svc:443/validate?timeout=10s": context deadline exceeded

We are using kyverno 1.5.2 in our kubernetes cluster i.e GKE cluster : v1.20.10 Getting this error while deleting or creating pod.

Please let us know how we can resolve this issue. We alreday enabled 9443 port in firewall

Error from server (InternalError): Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": Post "https://kyverno-svc.kyverno.svc:443/validate?timeout=10s": context deadline exceeded

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (11 by maintainers)

Most upvoted comments

Here are a few updates for this issue:

  1. Warning FailedCreate 6m59s (x49 over 8h) replicaset-controller Error creating: Internal error occurred: failed calling webhook “mutate.kyverno.svc-fail”: Post “https://kyverno-svc.kyverno.svc:443/mutate?timeout=10s”: no endpoints available for service “kyverno-svc”

There was a recent fix for the endpoint issue https://github.com/kyverno/kyverno/pull/2902.

  1. If multiple nodes get killed and there’s no guarantee for Kyverno to be running all the time, it’s recommended to scale down Kyverno to zero replicas in order to garbage collect all its webhook configurations.

  2. If there’s no control on the nodes shut down process, the webhook namespaceSelector can now be configured to exclude namespaces. This feature is available with the image 1.6-dev-latest, details at https://github.com/kyverno/kyverno/pull/2953.

Another workaround is to set FailurePolicy to Ignore to ignore errors and let admission requests pass through.

Closing for now, please let me know if the above solutions do not solve your use case.

We have the same issue. If all Kyverno pods (3 replicas of kyverno-1.5.2) are down (for example, liveness probe failure due k8s api timeouts), but kyverno-resource-validating-webhook-cfg exists and validate.kyverno.svc-fail contains secrets in the resources list, then kyverno cannot recover without manual intervention (via deleting webhook) and the next error appears in the log: E0107 13:26:55.939049 1 certmanager.go:92] CertManager "msg"="initialization error" "error"="failed to write CA cert to secret: Internal error occurred: failed calling webhook \"validate.kyverno.svc-fail\": Post \"https://kyverno-svc.system-policy.svc:443/validate?timeout=10s\": context deadline exceeded"

We’re having similar issues - for our dev clusters which are entirely preemptible GCP may kill all Nodes near-simultaneously, ignoring things like PodDisruptionBudgets. Even with antiaffinity we’re still finding all Kyverno replicas down, at which point the cluster is wedged.

Under the new autoUpdateWebhooks behaviour, there’s no namespaceSelector in the generated webhook configurations, so once Kyverno is down, if you have any policies that act on Pods, it can’t come back without human intervention.

@chipzoller We have preemptible vm enabled in our dev clusters when nodes goes down then kyverno service pod get evicted/shutdown. I see there is dependency in cluster that kyverno pod should be running while other namespaces pods are coming up. I have added exclusion in policies like this in cluster but no luck.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-safe-to-evict
  annotations:
    pod-policies.kyverno.io/autogen-controllers: none
spec:
  rules:
  - name: "add-safe-to-evict-to-pods"
    match:
      resources:
        kinds:
        - Pod
    exclude:
      resources:
        namespaces:
        - kube-system
        - istio-system
    mutate:
      patchStrategicMerge:
        metadata:
          annotations:
            cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

If kyverno pod is not running then our all cluster components fails to come up even istio services as well.

For work around I need to delete below validatingwebhookconfiguration and mutatingwebhookconfiguration Then everything will come up.

sachin@Sachins-MacBook-Pro  % kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations
NAME                                                                                                                          WEBHOOKS   AGE
validatingwebhookconfiguration.admissionregistration.k8s.io/kyverno-policy-validating-webhook-cfg                             1          4h11m
validatingwebhookconfiguration.admissionregistration.k8s.io/kyverno-resource-validating-webhook-cfg                           2          4h11m


NAME                                                                                              WEBHOOKS   AGE
mutatingwebhookconfiguration.admissionregistration.k8s.io/kyverno-policy-mutating-webhook-cfg     1          4h11m
mutatingwebhookconfiguration.admissionregistration.k8s.io/kyverno-resource-mutating-webhook-cfg   2          4h11m
mutatingwebhookconfiguration.admissionregistration.k8s.io/kyverno-verify-mutating-webhook-cfg     1          4h11m

This is not the expected behavior. Please let me know to mitigate this issue?