cert-manager: Documenting "context deadline exceeded" errors relating to the webhook

📢 Update from the cert-manager maintainers: For those of you encountering problems with the cert-manager webhook, please read @maelvls 's Definitive Debugging Guide for the cert-manager Webhook Pod.

Describe the bug:

When I try to create a ClusterIssuer I get the following error

kubectl apply -f cert-issuer-letsencrypt-dev.yml
Error from server (InternalError): error when creating "cert-issuer-letsencrypt-dev.yml":
Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io": 
Post https://kubernetes.default.svc:443/apis/webhook.certmanager.k8s.io/v1beta1/mutations?timeout=30s: 
context deadline exceeded

Expected behaviour:

Creation of ClusterIssuer works without errors

Steps to reproduce the bug:

Install cert-manager as follows

kubectl apply -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.10/deploy/manifests/00-crds.yaml
kubectl create namespace cert-manager
kubectl label namespace cert-manager certmanager.k8s.io/disable-validation=true
helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install \
  --name cert-manager \
  --namespace cert-manager \
  --version v0.10.1 \
  jetstack/cert-manager

Then I run

apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dev
  namespace: cert-manager
spec:
  acme:
    # The ACME server URL
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    # Email address used for ACME registration
    email: xxx@xxxx.com
    # Name of a secret used to store the ACME account private key
    privateKeySecretRef: 
      name: letsencrypt-dev
    # Enable the HTTP-01 challenge provider
    # http01: {}
    solvers:
    - dns01:
        cloudflare:
          email: xxxx@xxxx.com
          apiKeySecretRef:
            name: cloudflare-api-key-secret
            key: api-key

Anything else we need to know?:

Environment details::

  • Kubernetes version (e.g. v1.10.2): 1.15
  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): baremetal
  • cert-manager version (e.g. v0.4.0): 0.10.1
  • Install method (e.g. helm or static manifests): helm

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 25
  • Comments: 83 (4 by maintainers)

Most upvoted comments

Nope still stuck an this sucks

I can confirm that I have exactly the same issue. My environment:

EKS v1.16.8 CNI: Calico Cert-Manager: v0.15.1 installed using HELM

I’m getting errors like:

Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: Address is not allowed

I’m not 100%, but I suspect the issue with a connection from the API to the webhook (Calico creates new subnet, not sure if API is able to access it)…

Why is this issue closed ? I am facing same issue, with the helm chart of cert-manager v1.1.0

main.go:38] cert-manager "msg"="error executing command" "error"="listen tcp :10250: bind: address already in use"

@mostafa8026 seems to have corrected the issue by changing the port 10250. Why is this even a port issue ?

any updates on this issue?

Error from server (InternalError): error when creating “test-resources.yaml”: Internal error occurred: failed calling webhook “webhook.cert-manager.io”: Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

And my solution was using these changes in yaml file:

Adding the hostNetwork: true to the spec of webhook, and changing the securePort and its relative ports to something other than 10250 (like 10666 that I choose 😄, also don’t forget to change the related service), here are the changes:

...
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: webhook
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: cert-manager
    app.kubernetes.io/name: webhook
  name: cert-manager-webhook
  namespace: cert-manager
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: webhook
      app.kubernetes.io/instance: cert-manager
      app.kubernetes.io/name: webhook
  template:
    metadata:
      labels:
        app: webhook
        app.kubernetes.io/component: webhook
        app.kubernetes.io/instance: cert-manager
        app.kubernetes.io/name: webhook
    spec:
      hostNetwork: true
      containers:
      - args:
        - --v=2
        - --secure-port=10666
        - --dynamic-serving-ca-secret-namespace=$(POD_NAMESPACE)
        - --dynamic-serving-ca-secret-name=cert-manager-webhook-ca
        - --dynamic-serving-dns-names=cert-manager-webhook,cert-manager-webhook.cert-manager,cert-manager-webhook.cert-manager.svc,$(NODE_NAME)
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        image: quay.io/jetstack/cert-manager-webhook:v1.1.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: 6080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: cert-manager
        ports:
        - containerPort: 10666
          name: https
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 6080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
      serviceAccountName: cert-manager-webhook
---
...
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: webhook
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: cert-manager
    app.kubernetes.io/name: webhook
  name: cert-manager-webhook
  namespace: cert-manager
spec:
  ports:
  - name: https
    port: 443
    targetPort: 10666
  selector:
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: cert-manager
    app.kubernetes.io/name: webhook
  type: ClusterIP
...

Hello,

Just wanted to let everyone know that I have it working now. Some information on our cluster:

Cluster Version: 1.17.7
CNI: Flannel, VXLAN
Provisioner: kubeadm
Cert-Manager version: 0.11.1

What worked for me is the following guide here: https://docs.cert-manager.io/en/release-0.11/getting-started/install/kubernetes.html

It is absolutely important that nothing is lingering around from your old deploy. Run kubectl get crd and delete all (new and old) cert-manager CRD’s.

Run kubectl get apiservice and make sure there is nothing related to certificates

Running kubectl get cert or kubectl get clusterissuer should say something along the lines of “This resource type does not exist” (I don’t have the exact error, but you get the point).

Great. Now install the 0.11.1 CRD’s:

kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v0.11.1/cert-manager.yaml

Now install cert-manager 0.11.1. Make sure you install 0.11.1, not 0.11.0… That version doesn’t seem to work either.

Great. Now make sure your ClusterIssuer and Certificate’s are using the apiVersion: cert-manager.io/v1alpha2.

My Suspicions:

When installing 1.15.11, looking at the kube-apiserver logs, it appears that it’s trying to communicate to the webhook service by using its DNS (https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s). As I said above, this short-version of the DNS name does not work for some reason. Maybe it’s a kubeadm thing.

When using 0.11.1, it tries to communicate via IP address instead, and I suppose this is what is making it work.

Something important that I found during my research, the kube-apiserver can’t actually resolve cluster DNS. The /etc/resolv.conf is inherited from the master node. This is designed intentionally because apparently kube-apiserver is the source of truth for DNS.

Something I don’t understand is from a node, why can’t you ping a Service ClusterIP? You can do it for any pod on any node, not Services though. So I don’t get how the kube-apiserver is making calls to the webhook.

Sorry for rambling. Please let me know if you’re still struggling. I can try to help.

I am having the same issues as @andrewkaczynski

The haiku about DNS is true :

It’s not DNS There is a no way it’s DNS It was DNS

So, for those who are in this boat who are confused as heck by it, check that you can run dns queries from inside your pods Context deadline exceeded in my case indicated the pod couldn’t look up the acme api endpoint.

Also in my case, this was any outbound traffic to the internet at all from pods being blocked.

Furthermore, for those who land here where cert-manager is the first thing they set up on a cluster and are bitten by this : k3s on debian 10 requires you use the legacy iptables command (this may apply to other k8s distros, but definitely does to k3s) - https://github.com/coredns/coredns/issues/2693

I still submit that Context deadline exceeded is a poor error message and something more helpful here would be good.

I resolved my issue. May not apply to everyone, but still.

During cert creation, the API server accesses the webhook. But in my case, the API server cannot access pods in overlay network. So I have webhook running in hostNetwork mode. Now the error is gone.

any updates on this issue? I"m also seeing this, don’t know what causes the issue. still looking for a solution.

Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

My solution was: Download the Cert-Manager manifest (i.e. https://github.com/jetstack/cert-manager/releases/download/v1.1.0/cert-manager.yaml) and inserting the following block after each “containers:”-declaration in the manifest and appling it:

      dnsConfig:
        nameservers:
          - 8.8.8.8
          - 8.8.4.4

I resolved my issue. May not apply to everyone, but still.

During cert creation, the API server accesses the webhook. But in my case, the API server cannot access pods in overlay network. So I have webhook running in hostNetwork mode. Now the error is gone.

How did you install? I am trying helm install, with weave, on eks and I am getting the same errors. failed calling webhook “webhook.cert-manager.io” - The chart has hostNetwork set to false and it seems most of the instructions on how to get it to work are using older versions. I tried forking the charts and making the change, but then there was some type of image dependency. What was your method?

For me it was an issue with debian 10 and iptables, see here: https://discuss.kubernetes.io/t/kubernetes-compatible-with-debian-10-buster/7853

I have the same issue with:

  • Kubernetes version (e.g. v1.10.2): v1.17.0
  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): baremetal (openstack queen)
  • cert-manager version (e.g. v0.4.0): 0.13.1
  • Install method (e.g. helm or static manifests): helm 2.16.3

I think it is a timing problem because it was working when I was “manually” installing it, but not anymore in my terraform configuration script. Quite rough but I resolved it by adding a system delay just before applying the ingress and after the install:

sudo helm install \
  --name cert-manager \
  --namespace cert-manager \
  --version v0.13.1 \
  jetstack/cert-manager
sleep 1m
kubectl apply -f staging-issuer.yaml

Unfortunately restarting the pod didn’t fix the issue for me.

Edit: Well never mind, I restarted all the cert-manager related pods and now it worked. Strange

📢 Update from the cert-manager maintainers: For those of you encountering problems with the cert-manager webhook, please read @maelvls 's Definitive Debugging Guide for the cert-manager Webhook Pod.

I am also seeing the same issue. I have a charmed kubernetes cluster running with flannel and calico. I changed the deployment config to use hostnetwork and changed the port and nothing. Events: Type Reason Age From Message


Warning ErrVerifyACMEAccount 33s (x4 over 73s) cert-manager Failed to verify ACME account: context deadline exceeded Warning ErrInitIssuer 33s (x4 over 73s) cert-manager Error initializing issuer: context deadline exceeded

My DNS working because I can access letsencrypt from any port and the challenge see to have worked.

I’m seeing this on fresh aks clusters, right after installing of cert-manager, when I create an Issuer. Strange is that 2/3 clusters have this issue, but one of them doesn’t, although they are provisioned in the same way.

Starting: kubectl apply LetsEncrypt Staging Issuer
==============================================================================
Task         : Kubectl
Description  : Deploy, configure, update a Kubernetes cluster in Azure Container Service by running kubectl commands
Version      : 1.181.0
Author       : Microsoft Corporation
Help         : https://aka.ms/azpipes-kubectl-tsg
==============================================================================
==============================================================================
			Kubectl Client Version: v1.21.0
			Kubectl Server Version: v1.18.14
==============================================================================
/opt/hostedtoolcache/kubectl/1.21.0/x64/kubectl apply -f /home/vsts/work/_temp/kubectlTask/1620812300598/inlineconfig.yaml -o json
Error from server (InternalError): error when creating "/home/vsts/work/_temp/kubectlTask/1620812300598/inlineconfig.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: context deadline exceeded
##[error]Error from server (InternalError): error when creating "/home/vsts/work/_temp/kubectlTask/1620812300598/inlineconfig.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: context deadline exceeded
commandOutput
##[error]The process '/opt/hostedtoolcache/kubectl/1.21.0/x64/kubectl' failed with exit code 1
Finishing: kubectl apply LetsEncrypt Staging Issuer

The Issuer I’m adding that triggers the error is

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt-issuer
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: email@host.com
    privateKeySecretRef:
      name: letsencrypt-private-key
    solvers:
      - http01:
          ingress:
            class: nginx

We are running a fresh install of 1.17.7 via kubeadm, using the flannel VXLAN CNI. We’re also seeing the following error:

Error from server (InternalError): error when creating "./selfsigned-issuer.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

Doing some googling, I believe this is due to DNS. If I exec onto one of my NGINX pods (using this pod arbitrarily, nothing special about it) and try to resolve the above address, I get this:

nslookup cert-manager-webhook.cert-manager.svc
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find cert-manager-webhook.cert-manager.svc: NXDOMAIN
** server can't find cert-manager-webhook.cert-manager.svc: NXDOMAIN

However, when you append .cluster.local (the full domain name), it’ll resolve just fine:

bash-5.0$ nslookup cert-manager-webhook.cert-manager.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10:53


Name:   cert-manager-webhook.cert-manager.svc.cluster.local
Address: 10.111.57.175

And as you can see, this is the IP address of my webhook pod:

NAMESPACE              NAME                               TYPE        CLUSTER-IP
cert-manager           cert-manager-webhook               ClusterIP   10.111.57.175

So this is where I get confused… I read that .svc is the equivalent (or rather, short-version) of .svc.cluster.local. Why is it not working? Is this configurable? Reading a different issue, some person had to re-create their cluster and supply some DNS options into KubeSpray. However I’m not using KubeSpray.

Appreciate any help, thanks.

Edit: Here is the other issue I was referring to: https://github.com/jetstack/cert-manager/issues/2640

Folks I tracked it down to using –service-dns-domain=“k8.example.com” in my kubeadm init.

@papanito i check coredns and then it works

Can you please explain what you did with coredns to correct the problem?

And the pod seems like working:

/ # curl -vk https://10.32.0.4:10250/mutate?timeout=30s
*   Trying 10.32.0.4:10250...
* TCP_NODELAY set
* Connected to 10.32.0.4 (10.32.0.4) port 10250 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=cert-manager.system
*  start date: Feb 18 13:54:21 2020 GMT
*  expire date: Feb 17 13:54:21 2021 GMT
*  issuer: O=cert-manager.system; CN=cert-manager.webhook.ca
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
> GET /mutate?timeout=30s HTTP/1.1
> Host: 10.32.0.4:10250
> User-Agent: curl/7.67.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Date: Tue, 18 Feb 2020 14:57:14 GMT
< Content-Length: 0
< 
* Connection #0 to host 10.32.0.4 left intact

I somehow managed to workaround this issue when downgrading to v0.11. everything seems to working properly. https://docs.cert-manager.io/en/release-0.11/

@javachen so I guess my guess was wrong then 😦

@papanito centos7 helm3 k8s1.17.2