metallb: Webhook issues: InternalError (failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io")
This is an umbrella issue to try to provide troubleshooting information and to collect all the webhook related issues:
https://github.com/metallb/metallb/issues/1563 https://github.com/metallb/metallb/issues/1547 https://github.com/metallb/metallb/issues/1540
A very good guide that can be applied also to metallb is https://hackmd.io/@maelvls/debug-cert-manager-webhook Please note that the service name / webhook name might be slightly different when consuming the helm charts or the manifest.
Given a webhook failure, one must check
if the metallb controller is running and the endpoints of the service are healthy
kubectl get endpoints -n metallb-system
NAME ENDPOINTS AGE
webhook-service 10.244.2.2:9443 4h32m
If the caBundle is generated and the configuration is patched properly
To get the caBundle used by the webhooks:
kubectl get validatingwebhookconfiguration metallb-webhook-configuration -ojsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
To get the caBundle from the secret:
kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' | base64 -d
The caBundle in the webhook configuration and in the secret must match, and the raw version you get from kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}'
must be different from the default dummy one that can be found here: https://github.com/metallb/metallb/blob/93755e13238b0dd9f51f96c2271d5c3792df1ed0/config/crd/crd-conversion-patch.yaml#L15
Test if the service is reacheable from the apiserver node
Find the webhook service cluster ip:
kubectl get service -n metallb-system webhook-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
webhook-service ClusterIP 10.96.50.216 <none> 443/TCP 4h15m
Fetch the caBundle:
kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' | base64 -d > caBundle.pem
Move the caBundle.pem file to a node, and from the node try to curl the service, providing the resolution from the service fqdn to the service’s clusterIP (in this case, 10.96.50.216):
curl --cacert ./caBundle.pem --resolve webhook-service.metallb-system.svc:443:10.96.50.216 https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool
The expected result is the webhook complaining about missing content:
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}
But that will guarantee the certificate is valid.
In case the connection times out:
The instructions at https://hackmd.io/@maelvls/debug-cert-manager-webhook#Error-2-io-timeout can be followed.
Use tcpdump on port 443 to see if the traffic from the apiserver is directed to the endpoint (the controller pod’s ip in this case).
How to disable the webhook
A very quick workaround is to disable the webhook, which requires changing its failurePolicy to failurePolicy=Ignore
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 54 (6 by maintainers)
To resolve this issue, you can try the following steps: Verify that the webhook service is deployed and running in the metallb-system namespace. You can use the following command to check the status of the webhook service:
Look for a pod with a name similar to metallb-controller-xxxxx to confirm if it’s running. If the webhook service is not running, you may need to redeploy it. You can do this by deleting the existing webhook resources and letting them be recreated. Use the following command:
If you get any error, you may need to change end of the command above with this:
After all of this, you can apply metallb-adrpool.yaml and metallb-12.yaml
this is the content of my YAML files; metallb-adrpool.yaml
metallb-12.yaml
My secret and webhook don’t have any values for
caBundle
and I get this error. It’s a clean deployment to a new cluster ?UPDATE: I deleted the controller pod and both the secret and the webhook certs were recreated.
UPDATE 2: A few hours later both the secret, and the webhook had no value for the caBundle again…
I suspect this is due to using ArgoCD via
helm
andkustomize
what is the best way to exclude the syncing of these resources from ArgoCD if rendering via those tools any help would be appreciated.This is what fixed it for me!
Add
to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename
Hi, My observation is that its a bit random if this occurs or not. Another observation is that after observing the issue, if I retry after some minutes the webhook often succeeds. This suggest that there is some race involved.
Pure speculation, but could the API server initially cache an invalid cert before the cabundle injection and the waiting cause this cache to expire and the new injected certificate being loaded?
I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes
I had the same issue in minikube with 1.25 cluster version inside during LB externalIp allocation. After metallb deployment via manifest was completed, I passed all checks described in the post and got valid result:
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}
As Paul said above https://github.com/metallb/metallb/issues/1597#issuecomment-1271516693 I’ve deleted controller pod and certificate problem has gone.
Looks like some sort of race condition.
Then met the problem with my old ip pools manifest and, even logs were saying that it’s only deprecation
W1206 22:18:45.072864 1 warnings.go:70] metallb.io v1beta1 AddressPool is deprecated, consider using IPAddressPool
,I decided to replace the old one:
with the new one:
which solved my problem.
We started hitting this issue in our CI after migrating to
v0.13.5
. We’re using kind cluster, v1.21. Code that worked before:I noticed that in kind,
speaker
hasCreateContainerConfigError
initially:workaround that seems to be working so far for us is waiting until all pods (speaker & controller) are ready:
our
metal_lb_addrpool.yaml
:posting as fyi, in case anyone is looking for a workaround in kind.
For me it also worked. But why?
Hello! I do not know if this will be of help to anyone, but I had this same problem, and it was because the firewall was enabled and blocked some ports. By disabling the firewall on all nodes I managed to solve it, I am now looking at which ports to add as exceptions in the rules.
Been automating the deployment of a k3s cluster with ansible
And these are the templates
Initial deployment of a fresh cluster always leads to it failing at the create ip pool task with this error
but it doesn’t fail on the second and following reruns of the playbooks. I didn’t specify a chart version so it should be the latest. Hope it provides more info on figuring out what might be the issue. Generally, it has been random and I implemented a 30-second pause in case I actually needed to wait a bit. Sometimes it works, other times it doesn’t.
same problems
i find i have proxy config
when i delete them everything ok
I could manage to fix my error “Service Unavailable”,
by modifying
/etc/kubernetes/manifests/kube-apiserver.yaml
,.svc
to envno_proxy
The warning is given because the controllers are watching the resources, and they are marked as deprecated. Nothing to worry about.
It’s in the first post. Also, if you are using helm to deploy, you can use this value https://github.com/metallb/metallb/blob/main/charts/metallb/values.yaml#L332 from the last release.