cert-manager: TLS handshake error: EOF
Describe the bug:
Our running cert-manager-webhook instance is behaving odd. A complete re-deployment did not help either. The cert-manager-webhook is restarting all the time due to it’s readiness and livenessProbe:
Readiness probe failed: Get “http://10.244.3.212:6080/healthz”: context deadline exceeded
When I look into the logs, I see the following:
19
W1110 11:21:07.177272 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
18
W1110 11:21:07.277061 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
17
I1110 11:21:07.277592 1 webhook.go:70] cert-manager/webhook "msg"="using dynamic certificate generating using CA stored in Secret resource" "secret_name"="cert-manager-webhook-ca" "secret_namespace"="cert-manager"
16
I1110 11:21:07.278803 1 server.go:140] cert-manager/webhook "msg"="listening for insecure healthz connections" "address"=":6080"
15
I1110 11:21:07.378544 1 server.go:171] cert-manager/webhook "msg"="listening for secure connections" "address"=":10250"
14
I1110 11:21:07.378670 1 server.go:203] cert-manager/webhook "msg"="registered pprof handlers"
13
I1110 11:21:11.380192 1 dynamic_source.go:273] cert-manager/webhook "msg"="Updated serving TLS certificate"
12
I1110 11:23:47.879886 1 logs.go:58] http: TLS handshake error from 10.244.1.0:57155: EOF
11
I1110 11:23:50.475330 1 logs.go:58] http: TLS handshake error from 10.244.2.0:49158: EOF
10
I1110 11:23:50.875720 1 logs.go:58] http: TLS handshake error from 10.244.2.0:33850: EOF
9
I1110 11:23:50.976959 1 logs.go:58] http: TLS handshake error from 10.244.2.0:53597: EOF
8
I1110 11:23:51.378020 1 logs.go:58] http: TLS handshake error from 10.244.0.0:19278: EOF
7
I1110 11:23:51.475646 1 logs.go:58] http: TLS handshake error from 10.244.2.0:2307: EOF
6
I1110 11:23:51.675991 1 logs.go:58] http: TLS handshake error from 10.244.1.0:15006: EOF
5
I1110 11:35:03.976516 1 logs.go:58] http: TLS handshake error from 10.244.0.0:9939: EOF
4
I1110 11:35:05.375539 1 logs.go:58] http: TLS handshake error from 10.244.1.0:60231: EOF
3
I1110 11:35:05.778281 1 logs.go:58] http: TLS handshake error from 10.244.2.0:23763: EOF
2
I1110 11:35:06.177153 1 logs.go:58] http: TLS handshake error from 10.244.0.0:12657: EOF
1
I1110 11:35:06.977257 1 logs.go:58] http: TLS handshake error from 10.244.1.0:58889: EOF
I searched for the IP addresses in our pods, but I couldn’t find a pod running with these IPs.
Expected behaviour:
The cert-manager-webhook should start up as expected.
Steps to reproduce the bug:
Helm Chart 1.6.1 with the following values:
global:
imagePullSecrets: []
priorityClassName: ""
rbac:
create: true
podSecurityPolicy:
enabled: false
useAppArmor: true
logLevel: 2
leaderElection:
namespace: "kube-system"
installCRDs: true
replicaCount: 1
strategy: {}
featureGates: ""
image:
repository: quay.io/jetstack/cert-manager-controller
pullPolicy: IfNotPresent
clusterResourceNamespace: ""
serviceAccount:
create: true
automountServiceAccountToken: true
extraArgs: []
extraEnv: []
resources:
requests:
cpu: 20m
memory: 100Mi
limits:
cpu: 20m
memory: 100Mi
securityContext:
runAsNonRoot: true
containerSecurityContext: {}
volumes: []
volumeMounts: []
podLabels: {}
nodeSelector: {}
ingressShim: {}
prometheus:
enabled: false
servicemonitor:
enabled: false
prometheusInstance: default
targetPort: 9402
path: /metrics
interval: 60s
scrapeTimeout: 30s
labels: {}
affinity: {}
tolerations: []
webhook:
replicaCount: 1
timeoutSeconds: 30
strategy: {}
securityContext:
runAsNonRoot: true
containerSecurityContext: {}
extraArgs: []
resources:
requests:
cpu: 50m
memory: 100Mi
limits:
cpu: 50m
memory: 100Mi
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
nodeSelector: {}
affinity: {}
tolerations: []
podLabels: {}
serviceLabels: {}
image:
repository: quay.io/jetstack/cert-manager-webhook
pullPolicy: IfNotPresent
serviceAccount:
create: true
automountServiceAccountToken: true
securePort: 10250
hostNetwork: false
serviceType: ClusterIP
url: {}
cainjector:
enabled: true
replicaCount: 1
strategy: {}
securityContext:
runAsNonRoot: true
containerSecurityContext: {}
extraArgs: []
resources:
requests:
cpu: 50m
memory: 100Mi
limits:
cpu: 50m
memory: 100Mi
nodeSelector: {}
affinity: {}
tolerations: []
podLabels: {}
image:
repository: quay.io/jetstack/cert-manager-cainjector
pullPolicy: IfNotPresent
serviceAccount:
create: true
automountServiceAccountToken: true
startupapicheck:
enabled: true
securityContext:
runAsNonRoot: true
timeout: 1m
backoffLimit: 4
jobAnnotations:
helm.sh/hook: post-install
helm.sh/hook-weight: "1"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
extraArgs: []
resources:
requests:
cpu: 50m
memory: 100Mi
limits:
cpu: 50m
memory: 100Mi
nodeSelector: {}
affinity: {}
tolerations: []
podLabels: {}
image:
repository: quay.io/jetstack/cert-manager-ctl
pullPolicy: IfNotPresent
rbac:
annotations:
helm.sh/hook: post-install
helm.sh/hook-weight: "-5"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
serviceAccount:
create: true
annotations:
helm.sh/hook: post-install
helm.sh/hook-weight: "-5"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
automountServiceAccountToken: true
Anything else we need to know?:
Environment details::
- Kubernetes version: 1.21
- Cloud-provider/provisioner: KubeOne (vanilla Kubernetes installed with kubeadm via KubeOne)
- cert-manager version: 1.6.1
- Install method: e.g. helm/static helm /kind bug
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 12
- Comments: 26
I had a similar problem. What fixed it for me was deleting the
cert-manager-webhook
pod (which obviously caused it to be re-created).@jetstack-bot: Closing this issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Issues go stale after 90d of inactivity. Mark the issue as fresh with
/remove-lifecycle stale
. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with/close
. Send feedback to jetstack. /lifecycle staleSpecifically the EOF errors seem to be related to a Go bug and appear on Kubernetes 1.23 and 1.24 see https://github.com/kubernetes/kubernetes/issues/109022
To reproduce the issue:
Certificate
s ) that need to get validated by webhookI have not observed any actual issues related to these error messages (as the resources do get applied as expected), would be interested to hear if there are any. I do imagine that this might be causing webhook slowness in some cases.
At the moment I assume that there is nothing we can do to fix this, as the error is coming from Kubernetes apiserver (but will keep this issue open so we have a reference)
We are seeing the same error with v1.1.0 and restarting the webhook pod temporarily solved the problem but after about 10 minutes, same TLS error started showing up in webhook. Any explanation for why this might happen?
ca certs are valid and webhook ca.crt matches with webhook ca secret.
Stale issues rot after 30d of inactivity. Mark the issue as fresh with
/remove-lifecycle rotten
. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with/close
. Send feedback to jetstack. /lifecycle rotten /remove-lifecycle staleNot sure if this is still happening for folks, but I was seeing the
TSL handshake error: ... EOF
errors in mydeploy/cert-manager-webhook
logs on a fresh cluster this week, and would see the cert-manager steps/workflow get all the way up to acquiring a Certificate from LetsEncrypt, but then the webhook would always fail to provide the Certificate to the Pods.In my particular case I was able to resolve my problems by applying the Network Policies that are available as part of the cert-manager Helm chart with:
Issues go stale after 90d of inactivity. Mark the issue as fresh with
/remove-lifecycle stale
. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with/close
. Send feedback to jetstack. /lifecycle stale