kyverno: Error from server (InternalError): Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": Post "https://kyverno-svc.kyverno.svc:443/validate?timeout=10s": context deadline exceeded
We are using kyverno 1.5.2 in our kubernetes cluster i.e GKE cluster : v1.20.10 Getting this error while deleting or creating pod.
Please let us know how we can resolve this issue. We alreday enabled 9443 port in firewall
Error from server (InternalError): Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": Post "https://kyverno-svc.kyverno.svc:443/validate?timeout=10s": context deadline exceeded
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (11 by maintainers)
Here are a few updates for this issue:
There was a recent fix for the endpoint issue https://github.com/kyverno/kyverno/pull/2902.
If multiple nodes get killed and there’s no guarantee for Kyverno to be running all the time, it’s recommended to scale down Kyverno to zero replicas in order to garbage collect all its webhook configurations.
If there’s no control on the nodes shut down process, the webhook
namespaceSelector
can now be configured to exclude namespaces. This feature is available with the image1.6-dev-latest
, details at https://github.com/kyverno/kyverno/pull/2953.Another workaround is to set
FailurePolicy
toIgnore
to ignore errors and let admission requests pass through.Closing for now, please let me know if the above solutions do not solve your use case.
We have the same issue. If all Kyverno pods (3 replicas of kyverno-1.5.2) are down (for example, liveness probe failure due k8s api timeouts), but kyverno-resource-validating-webhook-cfg exists and validate.kyverno.svc-fail contains secrets in the resources list, then kyverno cannot recover without manual intervention (via deleting webhook) and the next error appears in the log:
E0107 13:26:55.939049 1 certmanager.go:92] CertManager "msg"="initialization error" "error"="failed to write CA cert to secret: Internal error occurred: failed calling webhook \"validate.kyverno.svc-fail\": Post \"https://kyverno-svc.system-policy.svc:443/validate?timeout=10s\": context deadline exceeded"
We’re having similar issues - for our dev clusters which are entirely preemptible GCP may kill all Nodes near-simultaneously, ignoring things like PodDisruptionBudgets. Even with antiaffinity we’re still finding all Kyverno replicas down, at which point the cluster is wedged.
Under the new autoUpdateWebhooks behaviour, there’s no namespaceSelector in the generated webhook configurations, so once Kyverno is down, if you have any policies that act on Pods, it can’t come back without human intervention.
@chipzoller We have
preemptible vm
enabled in our dev clusters when nodes goes down then kyverno service pod get evicted/shutdown. I see there is dependency in cluster thatkyverno pod should be running
while other namespaces pods are coming up. I have added exclusion in policies like this in cluster but no luck.If kyverno pod is not running then our all cluster components fails to come up even istio services as well.
For work around I need to delete below
validatingwebhookconfiguration
andmutatingwebhookconfiguration
Then everything will come up.This is not the expected behavior. Please let me know to mitigate this issue?