cert-manager: TLS handshake error: EOF

Describe the bug:

Our running cert-manager-webhook instance is behaving odd. A complete re-deployment did not help either. The cert-manager-webhook is restarting all the time due to it’s readiness and livenessProbe:

Readiness probe failed: Get “http://10.244.3.212:6080/healthz”: context deadline exceeded

When I look into the logs, I see the following:


19
W1110 11:21:07.177272       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
18
W1110 11:21:07.277061       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
17
I1110 11:21:07.277592       1 webhook.go:70] cert-manager/webhook "msg"="using dynamic certificate generating using CA stored in Secret resource"  "secret_name"="cert-manager-webhook-ca" "secret_namespace"="cert-manager"
16
I1110 11:21:07.278803       1 server.go:140] cert-manager/webhook "msg"="listening for insecure healthz connections"  "address"=":6080"
15
I1110 11:21:07.378544       1 server.go:171] cert-manager/webhook "msg"="listening for secure connections"  "address"=":10250"
14
I1110 11:21:07.378670       1 server.go:203] cert-manager/webhook "msg"="registered pprof handlers"
13
I1110 11:21:11.380192       1 dynamic_source.go:273] cert-manager/webhook "msg"="Updated serving TLS certificate"
12
I1110 11:23:47.879886       1 logs.go:58] http: TLS handshake error from 10.244.1.0:57155: EOF
11
I1110 11:23:50.475330       1 logs.go:58] http: TLS handshake error from 10.244.2.0:49158: EOF
10
I1110 11:23:50.875720       1 logs.go:58] http: TLS handshake error from 10.244.2.0:33850: EOF
9
I1110 11:23:50.976959       1 logs.go:58] http: TLS handshake error from 10.244.2.0:53597: EOF
8
I1110 11:23:51.378020       1 logs.go:58] http: TLS handshake error from 10.244.0.0:19278: EOF
7
I1110 11:23:51.475646       1 logs.go:58] http: TLS handshake error from 10.244.2.0:2307: EOF
6
I1110 11:23:51.675991       1 logs.go:58] http: TLS handshake error from 10.244.1.0:15006: EOF
5
I1110 11:35:03.976516       1 logs.go:58] http: TLS handshake error from 10.244.0.0:9939: EOF
4
I1110 11:35:05.375539       1 logs.go:58] http: TLS handshake error from 10.244.1.0:60231: EOF
3
I1110 11:35:05.778281       1 logs.go:58] http: TLS handshake error from 10.244.2.0:23763: EOF
2
I1110 11:35:06.177153       1 logs.go:58] http: TLS handshake error from 10.244.0.0:12657: EOF
1
I1110 11:35:06.977257       1 logs.go:58] http: TLS handshake error from 10.244.1.0:58889: EOF

I searched for the IP addresses in our pods, but I couldn’t find a pod running with these IPs.

Expected behaviour:

The cert-manager-webhook should start up as expected.

Steps to reproduce the bug:

Helm Chart 1.6.1 with the following values:

  global:
    imagePullSecrets: []
    priorityClassName: ""
    rbac:
      create: true
    podSecurityPolicy:
      enabled: false
      useAppArmor: true
    logLevel: 2
    leaderElection:
      namespace: "kube-system"
  installCRDs: true
  replicaCount: 1
  strategy: {}
  featureGates: ""
  image:
    repository: quay.io/jetstack/cert-manager-controller
    pullPolicy: IfNotPresent
  clusterResourceNamespace: ""
  serviceAccount:
    create: true
    automountServiceAccountToken: true
  extraArgs: []
  extraEnv: []
  resources:
    requests:
      cpu: 20m
      memory: 100Mi
    limits:
      cpu: 20m
      memory: 100Mi
  securityContext:
    runAsNonRoot: true
  containerSecurityContext: {}
  volumes: []
  volumeMounts: []
  podLabels: {}
  nodeSelector: {}
  ingressShim: {}
  prometheus:
    enabled: false
    servicemonitor:
      enabled: false
      prometheusInstance: default
      targetPort: 9402
      path: /metrics
      interval: 60s
      scrapeTimeout: 30s
      labels: {}
  affinity: {}
  tolerations: []
  webhook:
    replicaCount: 1
    timeoutSeconds: 30
    strategy: {}
    securityContext:
      runAsNonRoot: true
    containerSecurityContext: {}
    extraArgs: []
    resources:
      requests:
        cpu: 50m
        memory: 100Mi
      limits:
        cpu: 50m
        memory: 100Mi
    livenessProbe:
      failureThreshold: 3
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    readinessProbe:
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    nodeSelector: {}
    affinity: {}
    tolerations: []
    podLabels: {}
    serviceLabels: {}
    image:
      repository: quay.io/jetstack/cert-manager-webhook
      pullPolicy: IfNotPresent
    serviceAccount:
      create: true
      automountServiceAccountToken: true
    securePort: 10250
    hostNetwork: false
    serviceType: ClusterIP
    url: {}
  cainjector:
    enabled: true
    replicaCount: 1
    strategy: {}
    securityContext:
      runAsNonRoot: true
    containerSecurityContext: {}
    extraArgs: []
    resources:
      requests:
        cpu: 50m
        memory: 100Mi
      limits:
        cpu: 50m
        memory: 100Mi
    nodeSelector: {}
    affinity: {}
    tolerations: []
    podLabels: {}
    image:
      repository: quay.io/jetstack/cert-manager-cainjector
      pullPolicy: IfNotPresent
    serviceAccount:
      create: true
      automountServiceAccountToken: true
  startupapicheck:
    enabled: true
    securityContext:
      runAsNonRoot: true
    timeout: 1m
    backoffLimit: 4
    jobAnnotations:
      helm.sh/hook: post-install
      helm.sh/hook-weight: "1"
      helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
    extraArgs: []
    resources:
      requests:
        cpu: 50m
        memory: 100Mi
      limits:
        cpu: 50m
        memory: 100Mi
    nodeSelector: {}
    affinity: {}
    tolerations: []
    podLabels: {}
    image:
      repository: quay.io/jetstack/cert-manager-ctl
      pullPolicy: IfNotPresent
    rbac:
      annotations:
        helm.sh/hook: post-install
        helm.sh/hook-weight: "-5"
        helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
    serviceAccount:
      create: true
      annotations:
        helm.sh/hook: post-install
        helm.sh/hook-weight: "-5"
        helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
      automountServiceAccountToken: true

Anything else we need to know?:

Environment details::

  • Kubernetes version: 1.21
  • Cloud-provider/provisioner: KubeOne (vanilla Kubernetes installed with kubeadm via KubeOne)
  • cert-manager version: 1.6.1
  • Install method: e.g. helm/static helm /kind bug

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 12
  • Comments: 26

Most upvoted comments

I had a similar problem. What fixed it for me was deleting the cert-manager-webhook pod (which obviously caused it to be re-created).

@jetstack-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten. Send feedback to jetstack. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Specifically the EOF errors seem to be related to a Go bug and appear on Kubernetes 1.23 and 1.24 see https://github.com/kubernetes/kubernetes/issues/109022

To reproduce the issue:

  • deploy cert-manager on Kubernetes 1.23 or 1.24
  • apply a number of resources (Certificates ) that need to get validated by webhook
  • observe the EOF errors

I have not observed any actual issues related to these error messages (as the resources do get applied as expected), would be interested to hear if there are any. I do imagine that this might be causing webhook slowness in some cases.

At the moment I assume that there is nothing we can do to fix this, as the error is coming from Kubernetes apiserver (but will keep this issue open so we have a reference)

We are seeing the same error with v1.1.0 and restarting the webhook pod temporarily solved the problem but after about 10 minutes, same TLS error started showing up in webhook. Any explanation for why this might happen?

ca certs are valid and webhook ca.crt matches with webhook ca secret.

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle rotten /remove-lifecycle stale

Not sure if this is still happening for folks, but I was seeing the TSL handshake error: ... EOF errors in my deploy/cert-manager-webhook logs on a fresh cluster this week, and would see the cert-manager steps/workflow get all the way up to acquiring a Certificate from LetsEncrypt, but then the webhook would always fail to provide the Certificate to the Pods.

In my particular case I was able to resolve my problems by applying the Network Policies that are available as part of the cert-manager Helm chart with:

networkPolicy:
  enabled: true

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale