kubewarden-controller: Investigate failure on webhooks not ready when installing cert-manager from helm chart

With the cert-manager helm chart:

$ helm repo add jetstack https://charts.jetstack.io && helm repo update
$ helm install --wait cert-manager jetstack/cert-manager --create-namespace -n cert-manager --set installCRDs=true

$ helm repo add kubewarden https://charts.kubewarden.io && helm repo update
$ helm install --create-namespace -n kubewarden kubewarden-crds kubewarden/kubewarden-crds
$ helm install --wait -n kubewarden kubewarden-controller kubewarden/kubewarden-controller

Every time I try with those commands, on a fresh k3d cluster, the kubewarden-controller post-intall hook with the default policy-server fails:

Error: failed post-install: warning: Hook post-install kubewarden-controller/templates/policyserver-default.yaml failed: Internal error occurred: failed calling webhook "mpolicyserver.kb.io": Post "https://kubewarden-controller-webhook-service.kubewarden.svc:443/mutate-policies-kubew
arden-io-v1alpha2-policyserver?timeout=10s": dial tcp 10.43.7.43:443: connect: connection refused

Error: failed post-install: warning: Hook post-install kubewarden-controller/templates/policyserver-default.yaml failed: Internal error occurred: failed calling webhook "mpolicyserver.kb.io": Post "https://kubewarden-controller-webhook-service.kubewarden.svc:443/mutate-policies-kubewarden-io-v1alpha2-policyserver?timeout=10s": context deadline exceeded

This doesn’t happen to me if installing cert-manager via static install:

$ kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.4/cert-manager.yaml

I suppose it’s just timing on my system, but it’s incredibly annoying. I looked up the manager ready probes, and the state of the resources after deploying our kubewarden-controller chart, and they seem fine. I don’t see why it happens yet.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (16 by maintainers)

Most upvoted comments

Thanks a lot @jvanz, this is really helpful! Also, I hit this problem one time on my environment when running the e2e tests several times, so I can also reproduce the issue to some extent 😃

Great investigation work. Kudos!

ereslibre on Feb 17, 2022

I think I know what’s going on… kubernetes services, by default, use an iptables proxy mode. This mean that when a process tries to access a clusterIP iptables rules redirect the packets to the right pod. This rules are written by kube-proxy which watches the control plane for new services and endpoints.

My hypothesis is that kube-proxy is not fast enough to detect the changes and rewrite the iptables rules. Thus, when kube-api tries to access the service there is no route to the pod, causing the error. This also explains why after some time (very short) we can reach the service.

To validate this idea I’ve retested using the script from https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285 and change the line where minikube cluster is started with:

minikube start --feature-gates=EphemeralContainers=true --extra-config=apiserver.v=5 --extra-config 'kube-proxy.mode=userspace'

Changing the proxy mode to userspace the clusterIP now points to kube-proxy. Kube-proxy will then proxy the packet to the pod. It works! Again, at least my script is not failing anymore.

Considering that we already have 2 workarounds for this, I believe we can close this issue. Furthermore, if we want improve the user experience we may try to add some mechanics to detect this behavior in the helm installation as mentioned in https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285

@viccuad feel free to reopen with if there is something missing.

jvanz on Feb 17, 2022

I didn’t have the time to check that. I found some issues with @viccuad on this environment when using k9s. Docker containers had connectivity, but couldn’t resolve names. We changed the resolv.conf on one of the containers to point to 8.8.8.8 and it could resolve fine. At that point in time containers started to appear as Running.

I don’t have any comments about https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285 to be honest.

My opinion is that it’s a really hard problem to solve because we don’t know what is wrong and is not easy to reproduce and track.

There is also the possibility that different problems have the same consequence. E.g, a myriad of problems can lead to a single error connection refused. To simplify in such a situation: is something wrong on the client?, is something wrong on the server?, is something wrong in between?. What exactly, where, how…? However, you always see the same error: connection refused, but literally a myriad reasons can lead to show that error.

In my opinion there is not enough information and evidence to tackle this problem properly.

I vote to not try to tackle it given the impact is extremely small. If the situation changes, priority will raise and we will be in a better situation to debug, identify and fix.

ereslibre on Feb 14, 2022

Closing the card as discussed in planning meeting, as the scope is already achieved. Work continues in https://github.com/kubewarden/kubewarden-controller/issues/142.

viccuad on Jan 14, 2022

Good catch @jvanz 👏

I think we should implement a readiness probe

flavio on Jan 10, 2022

I’m taking a look in the controller-runtime and kubewarden-controller code to understand when we set the controller as ready. Turns out the controller always returns as ready. We are using the Ping checker. Which always returns true. Furthermore, the controller-runtime starts the health probes in a gorotine before the goroutine to trigger the manager runnables. Thus, if the helm is fast enough, it could try to run the post-install hook before the controller really ready.

I’m not 100% sure if my rationale is right. Because I do not have too much experience with Kubernetes controllers. But I think we can try to improve our health check to ensure that we set the controller as ready only after the policy server controller start running. However, I could not find any easy way to do that using the data available in the manager.

I’m disconsidering that this is a Helm issue as mentioned before by @viccuad. Just because I could not simulate the issue and the bug mentioned should be fix in the latest Helm versions.

jvanz on Dec 28, 2021