metallb: Webhook issues: InternalError (failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io")

This is an umbrella issue to try to provide troubleshooting information and to collect all the webhook related issues:

https://github.com/metallb/metallb/issues/1563 https://github.com/metallb/metallb/issues/1547 https://github.com/metallb/metallb/issues/1540

A very good guide that can be applied also to metallb is https://hackmd.io/@maelvls/debug-cert-manager-webhook Please note that the service name / webhook name might be slightly different when consuming the helm charts or the manifest.

Given a webhook failure, one must check

if the metallb controller is running and the endpoints of the service are healthy

kubectl get endpoints -n metallb-system
NAME              ENDPOINTS         AGE
webhook-service   10.244.2.2:9443   4h32m

If the caBundle is generated and the configuration is patched properly

To get the caBundle used by the webhooks:

kubectl get validatingwebhookconfiguration metallb-webhook-configuration -ojsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d

To get the caBundle from the secret:

kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' | base64 -d

The caBundle in the webhook configuration and in the secret must match, and the raw version you get from kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' must be different from the default dummy one that can be found here: https://github.com/metallb/metallb/blob/93755e13238b0dd9f51f96c2271d5c3792df1ed0/config/crd/crd-conversion-patch.yaml#L15

Test if the service is reacheable from the apiserver node

Find the webhook service cluster ip:

kubectl get service -n metallb-system  webhook-service
NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
webhook-service   ClusterIP   10.96.50.216   <none>        443/TCP   4h15m

Fetch the caBundle:

kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' | base64 -d > caBundle.pem

Move the caBundle.pem file to a node, and from the node try to curl the service, providing the resolution from the service fqdn to the service’s clusterIP (in this case, 10.96.50.216):

curl --cacert ./caBundle.pem --resolve webhook-service.metallb-system.svc:443:10.96.50.216 https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool

The expected result is the webhook complaining about missing content:

{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}

But that will guarantee the certificate is valid.

In case the connection times out:

The instructions at https://hackmd.io/@maelvls/debug-cert-manager-webhook#Error-2-io-timeout can be followed.

Use tcpdump on port 443 to see if the traffic from the apiserver is directed to the endpoint (the controller pod’s ip in this case).

How to disable the webhook

A very quick workaround is to disable the webhook, which requires changing its failurePolicy to failurePolicy=Ignore

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 54 (6 by maintainers)

Most upvoted comments

To resolve this issue, you can try the following steps: Verify that the webhook service is deployed and running in the metallb-system namespace. You can use the following command to check the status of the webhook service:

kubectl get pods -n metallb-system

Look for a pod with a name similar to metallb-controller-xxxxx to confirm if it’s running. If the webhook service is not running, you may need to redeploy it. You can do this by deleting the existing webhook resources and letting them be recreated. Use the following command:

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io metallb-webhook-config

If you get any error, you may need to change end of the command above with this:

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io metallb-webhook-configuration

After all of this, you can apply metallb-adrpool.yaml and metallb-12.yaml

this is the content of my YAML files; metallb-adrpool.yaml

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 10.1.81.151-10.1.81.155

Kubectl apply -f metallb-adrpool.yaml

metallb-12.yaml

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: example
  namespace: metallb-system
Kubectl apply -f metallb-12.yaml

My secret and webhook don’t have any values for caBundle and I get this error. It’s a clean deployment to a new cluster ?

UPDATE: I deleted the controller pod and both the secret and the webhook certs were recreated.

UPDATE 2: A few hours later both the secret, and the webhook had no value for the caBundle again…

I suspect this is due to using ArgoCD via helm and kustomize what is the best way to exclude the syncing of these resources from ArgoCD if rendering via those tools any help would be appreciated.

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!

kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>

Add

nodeName: <kubernetes master>

to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

Hi, My observation is that its a bit random if this occurs or not. Another observation is that after observing the issue, if I retry after some minutes the webhook often succeeds. This suggest that there is some race involved.

Pure speculation, but could the API server initially cache an invalid cert before the cabundle injection and the waiting cause this cache to expire and the new injected certificate being loaded?

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

I had the same issue in minikube with 1.25 cluster version inside during LB externalIp allocation. After metallb deployment via manifest was completed, I passed all checks described in the post and got valid result: {"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}
As Paul said above https://github.com/metallb/metallb/issues/1597#issuecomment-1271516693 I’ve deleted controller pod and certificate problem has gone.
Looks like some sort of race condition.

Then met the problem with my old ip pools manifest and, even logs were saying that it’s only deprecation W1206 22:18:45.072864 1 warnings.go:70] metallb.io v1beta1 AddressPool is deprecated, consider using IPAddressPool,
I decided to replace the old one:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
      - name: default
        protocol: layer2
        addresses:
          - "172.17.255.1-172.17.255.255"

with the new one:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default
  namespace: metallb-system
spec:
  addresses:
    - "172.17.255.1-172.17.255.255"
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system

which solved my problem.

We started hitting this issue in our CI after migrating to v0.13.5. We’re using kind cluster, v1.21. Code that worked before:

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml
kubectl -n metallb-system wait deploy/controller --timeout=90s --for=condition=Available
kubectl apply -f ./metal_lb_cm.yaml

I noticed that in kind, speaker has CreateContainerConfigError initially:

metallb-system       controller-6846c94466-bn2qx                0/1     Running                      0          15s
metallb-system       speaker-6t6x8                              0/1     CreateContainerConfigError   0          15s
metallb-system       speaker-6t6x8                              0/1     Running                      0          17s
metallb-system       controller-6846c94466-bn2qx                1/1     Running                      0          20s
metallb-system       speaker-6t6x8                              1/1     Running                      0          30s

workaround that seems to be working so far for us is waiting until all pods (speaker & controller) are ready:

kubectl -n metallb-system wait pod --all --timeout=90s --for=condition=Ready
kubectl -n metallb-system wait deploy controller --timeout=90s --for=condition=Available
kubectl -n metallb-system wait apiservice v1beta1.metallb.io --timeout=90s --for=condition=Available
kubectl apply -f ./metal_lb_addrpool.yaml

our metal_lb_addrpool.yaml:

--- Metal LB config:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: example
  namespace: metallb-system
spec:
  addresses:
  - 172.18.255.200-172.18.255.255
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: empty
  namespace: metallb-system

posting as fyi, in case anyone is looking for a workaround in kind.

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!

kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>

Add

nodeName: <kubernetes master>

to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

For me it also worked. But why?

Hello! I do not know if this will be of help to anyone, but I had this same problem, and it was because the firewall was enabled and blocked some ports. By disabling the firewall on all nodes I managed to solve it, I am now looking at which ports to add as exceptions in the rules.

Been automating the deployment of a k3s cluster with ansible

- name: Add metallb repo
  kubernetes.core.helm_repository:
    name: metallb
    repo_url: "https://metallb.github.io/metallb"

- name: install Metallb
  kubernetes.core.helm:
    name: metallb
    chart_ref: metallb/metallb
    namespace: metallb-system
    create_namespace: true

- name: Create IP pool
  kubernetes.core.k8s:
    definition: "{{ lookup('template', 'pool.yaml.j2') | from_yaml_all | list }}"

- name: Create L2 announcement
  kubernetes.core.k8s:
    definition: "{{ lookup('template', 'announcement.yaml') | from_yaml_all | list }}"

And these are the templates

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: dev-pool
  namespace: metallb-system
spec:
  addresses:
  - {{ metallb_addresses }}
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
 name: l2-announcement
 namespace: metallb-system

Initial deployment of a fresh cluster always leads to it failing at the create ip pool task with this error

fatal: [localhost]: FAILED! => {"changed": false, "error": 500, "msg": "IPAddressPool dev-pool: Failed to create object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Internal error occurred: failed calling webhook \\\\\"ipaddresspoolvalidationwebhook.metallb.io\\\\\": failed to call webhook: Post \\\\\"https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s\\\\\": no endpoints available for service \\\\\"metallb-webhook-service\\\\\"\",\"reason\":\"InternalError\",\"details\":{\"causes\":[{\"message\":\"failed calling webhook \\\\\"ipaddresspoolvalidationwebhook.metallb.io\\\\\": failed to call webhook: Post \\\\\"https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s\\\\\": no endpoints available for service \\\\\"metallb-webhook-service\\\\\"\"}]},\"code\":500}\\n'", "reason": "Internal Server Error", "status": 500}

but it doesn’t fail on the second and following reruns of the playbooks. I didn’t specify a chart version so it should be the latest. Hope it provides more info on figuring out what might be the issue. Generally, it has been random and I implemented a 30-second pause in case I actually needed to wait a bit. Sometimes it works, other times it doesn’t.

我可以设法修复我的错误“服务不可用”,

Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": Service Unavailable

通过修改/etc/kubernetes/manifests/kube-apiserver.yaml,

  • 添加.svc到环境no_proxy

same problems

# kubectl apply -f IPAddressPool.yaml
Error from server (InternalError): error when creating "IPAddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": EOF
Error from server (InternalError): error when creating "IPAddressPool.yaml": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": EOF

i find i have proxy config

# cat /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
........
    env:
    - name: no_proxy
      value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,172.17.0.1,.svc.cluster.local,apiserver.cluster.local,100.64.0.0/10
    - name: ftp_proxy
      value: http://192.168.72.1:7890
    - name: https_proxy
      value: http://192.168.72.1:7890
    - name: NO_PROXY
      value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,172.17.0.1,.svc.cluster.local,apiserver.cluster.local,100.64.0.0/10
    - name: FTP_PROXY
      value: http://192.168.72.1:7890
    - name: HTTPS_PROXY
      value: http://192.168.72.1:7890
    - name: HTTP_PROXY
      value: http://192.168.72.1:7890
    - name: http_proxy
      value: http://192.168.72.1:7890

when i delete them everything ok

# kubectl apply -f IPAddressPool.yaml
ipaddresspool.metallb.io/first-pool created
l2advertisement.metallb.io/l2 created

I could manage to fix my error “Service Unavailable”,

Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": Service Unavailable

by modifying /etc/kubernetes/manifests/kube-apiserver.yaml,

  • add .svc to env no_proxy

The warning is given because the controllers are watching the resources, and they are marked as deprecated. Nothing to worry about.

Experiencing same issue with webhook failure. MicroK8s v1.25.2 revision 4055 Is there a quick workaround to turn off webhooks? Thank you.

It’s in the first post. Also, if you are using helm to deploy, you can use this value https://github.com/metallb/metallb/blob/main/charts/metallb/values.yaml#L332 from the last release.