cert-manager: FailedDiscoveryCheck (403) with cert-manager Webhook

Describe the bug: I’m trying to deploy a on-prem k8s cluster and I want to user cert-manager for the certificates. When I try to create a ClusterIssuer, it says that

Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io": the server is currently unable to handle the request

When I run kubectl get apiservice it returns me the following error: failing or missing response from https://<internal-svc-ip>:443/apis/webhook.certmanager.k8s.io/v1beta1: bad status from https://<internal-svc-ip>:443/apis/webhook.certmanager.k8s.io/v1beta1: 403

Expected behaviour: Issuer is created when I run kubectl apply

Steps to reproduce the bug:

  • Create namespace cert-manager
  • Deploy using the manifest YAML
  • Try to create an Issuer following the example and documentation. Also trying the tests

Anything else we need to know?:

Environment details::

  • Kubernetes version (e.g. v1.10.2): 1.15.3

  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): on-prem

  • cert-manager version (e.g. v0.4.0): 0.10

  • Install method (e.g. helm or static manifests): static manifest at https://github.com/jetstack/cert-manager/releases/download/v0.10.0/cert-manager.yaml

  • YAML file:

apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    email: <my-mail>
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      # Secret resource used to store the account's private key.
      name: example-clusterissuer-key
    # Add a single challenge solver, HTTP01 using nginx
    solvers:
    - http01:
        ingress:
          class: nginx

I also have installed nginxinc/kubernetes-ingress

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 27 (5 by maintainers)

Most upvoted comments

@otakumike sure thing, here it is. Given the logs and error messages I knew the port had to be 6443 and the source addresses those of the k8s master, hence:

# 1) Retrieve the network tag automatically given to the worker nodes
# NOTE: this only works if you have only one cluster in your GCP project. You will have to manually inspect the result of this command to find the tag for the cluster you want to target
WORKER_NODES_TAG=$(gcloud compute instances list --format='text(tags.items[0])' --filter='metadata.kubelet-config:*' | grep tags | awk '{print $2}' | sort | uniq)

# 2) Take note of the VPC network in which you deployed your cluster
# NOTE this only works if you have only one network in which you deploy your clusters
NETWORK=$(gcloud compute instances list --format='text(networkInterfaces[0].network)' --filter='metadata.kubelet-config:*' | grep networks | awk -F'/' '{print $NF}' | sort | uniq)

# 3) Create the firewall rule targeting the tag above
gcloud compute firewall-rules create k8s-cert-manager \                                                                                                           
  --source-ranges 172.16.0.0/28 \
  --target-tags $WORKER_NODES_TAG  \
  --allow TCP:6443 --network $NETWORK

I’m seeing this as well in EKS when trying to use a custom CNI. For metrics server, I put the API Service on the host network and that resolved the issue:

https://github.com/helm/charts/blob/c4d3dde988271fddf80c00bd9281453202234b9d/stable/metrics-server/templates/metrics-server-deployment.yaml#L38-L40

Can we get something like this for the cert-manager chart? Manually adding this to the deployment after the install makes the API Service go available:

kubectl get apiservice v1beta1.webhook.cert-manager.io
NAME                              SERVICE                             AVAILABLE   AGE
v1beta1.webhook.cert-manager.io   cert-manager/cert-manager-webhook   True        2d20h

Just stumbled upon this. It seems to be related to #2340. I also have a private cluster with GKE and adding an ingress firewall rule granting access from the master API CIDR range to port 6443 resolved the issue for me.

This is also documented here

Same problem with kubeadm on AWS. Kubernetes: 1.16.0

Just to clarify, you should not need to create any additional RBAC resources in order to make the webhook work.

Issues like this stem from communication problems between the Kubernetes apiserver and the webhook component, and you can follow the ‘chain’ of communication like so:

  • The webhook runs in the cluster in the cert-manager namespace
  • A Kubernetes APIService resource exposes the webhook as a part of the Kubernetes API
  • A Kubernetes ValidatingWebhookConfiguration resources tells the apiserver to talk to the webhook via the APIService resource (i.e. it loops back and talks to itself) in order to validate resources.

If any part of that communication flow doesn’t work, you’ll see errors as you’ve described.

Typically, and as some people have noted above, this sometimes falls down at the * A Kubernetes APIService resource exposes the webhook as a part of the Kubernetes API level - the Kubernetes apiserver is unable to communicate with the webhook.

This can be caused by many things, but for example, on GKE this is caused by firewall rules blocking communication to the Kubernetes ‘worker’ nodes from the control plane. This is remediated by adding additional firewall rules to grant this permission.

On AWS, it really depends on how you’ve configured your VPCs/security groups and how you’ve configured networking. Notably though, you must configure your control plane so that it can communicate with pod/service IPs from the ‘apiserver’ container/network namespace.

You’ll also run into this issue if you try and deploy metrics-server too, as this is deployed in a similar fashion.

@skuro are you using “private GKE nodes” by any chance?

In my case on a fresh GKE cluster (v1.13.7-gke.24) with kubectl (v1.11.1 or v1.14.3) it seems to just be a matter of waiting.

After I first apply the static manifest:

kubectl apply --validate=false -f https://github.com/jetstack/cert-manager/releases/download/v0.10.1/cert-manager.yaml

If I try to create any ClusterIssuer right away, I get:

Error from server (NotFound): error when deleting "cluster/platform/cert-manager/2_issuers.yaml": the server could not find the requested resource (delete clusterissuers.certmanager.k8s.io letsencrypt-staging)
Error from server (NotFound): error when deleting "cluster/platform/cert-manager/2_issuers.yaml": the server could not find the requested resource (delete clusterissuers.certmanager.k8s.io letsencrypt-prod)

This seems to correspond with:

$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
endpoints for service/cert-manager-webhook in "cert-manager" have no addresses

But if I wait a few seconds, that eventually changes to:

$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
all checks passed

And at that point if I try again to apply my ClusterIssuer manifest it works. This stops me from being able to kubectl apply -Rf my whole cert-manager + issuers manifests in one go.

Isn’t there some way to let me declare everything at once and have the issuers work when they’re ready? Isn’t that the k8s way?

Update: Workaround

This workaround gets it done for me for now:

kubectl apply -Rf cert-manager/manifest.yaml
# work around https://github.com/jetstack/cert-manager/issues/2109
until [ "$(kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')" == "True" ];
do echo "Waiting for v1beta1.webhook.certmanager.k8s.io..." && sleep 1
done
kubectl apply -Rf cert-manager/issuers.yaml

Looks like it is the same as https://github.com/istio/istio/issues/10637. I build my clusters with Terraform and I was able to solve the linked issue by adding the following security group rule:

resource "aws_security_group_rule" "node_control_plane_https" {
  description              = "Allow HTTPS from control plane to nodes"
  from_port                = 443
  protocol                 = "tcp"
  security_group_id        = aws_security_group.node.id
  source_security_group_id = aws_security_group.control_plane.id
  to_port                  = 443
  type                     = "ingress"
}

I will test later whether this solves this issue here, too.