k3s: Random webhook failures with EOF
Environmental Info: K3s Version:
k3s version v1.23.8+k3s1 (53f2d4e7) go version go1.17.5
Node(s) CPU architecture, OS, and Version:
Linux k8s-controller-1 5.15.0-39-generic #42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
1 server, 3 agent
Describe the bug:
When FluxCD tries to do a dry-run on resources where webhook is configured (e.g.: cert-manager) the dry-run fails with EOF error.
On the webhook side it looks like it’s not a proper https/tls connection: first record does not look like a TLS handshake
Steps To Reproduce:
- Installed K3s:
k3sup install --ip {{ ansible_default_ipv4.address }}
--ssh-key {{ lookup('env','HOME') }}/.ssh/id_ed25519
--user {{ ansible_ssh_user }}
--k3s-extra-args '
--disable traefik
--disable servicelb
--disable-cloud-controller
--node-taint CriticalAddonsOnly=true:NoExecute
'
--k3s-channel stable
k3sup join --ip {{ ansible_default_ipv4.address }}
--server-ip {{ hostvars[groups['master'][0]].ansible_default_ipv4.address }}
--ssh-key {{ lookup('env','HOME') }}/.ssh/id_ed25519
--user {{ ansible_ssh_user }}
--k3s-channel stable
Expected behavior:
Dry run succeeds
Actual behavior:
Dry run fails sometimes with the same configuration
Additional context / logs:
Dry run error:
dry-run failed, reason: InternalError, error: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "[https://kube-prometheus-stack-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s](https://kube-prometheus-stack-operator.monitoring.svc/admission-prometheusrules/validate?timeout=10s)": EOF
k3s logs from journalctl
k3s[682]: W0710 08:03:32.023261 682 dispatcher.go:195] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.network.svc:443/mutate?timeout=10s": EOF
cert-manager webhook pod logs
I0710 08:03:32.022971 1 logs.go:59] http: TLS handshake error from 10.42.0.0:50900: tls: first record does not look like a TLS handshake
Backporting
- Needs backporting to older releases
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 37 (19 by maintainers)
A fix for this should make it into this month’s release cycle. Thanks for your patience everyone, and providing me with enough info to finally reproduce it reliably.
We’re about to release RCs for today’s upstream patch releases, which will include some updates to the egress-selector code and remotedialer library. I’m curious if this will address the issues you’re seeing.
That is doubtful, I am seeing the issue using Fedora Server as well.
What kind of cpu/memory/disk resources do these nodes have? What pod are you execing into; is one one of the packaged ones or something that you’ve deployed yourself? Are you running the kubectl command locally, or on another host?
I’ve tried load-testing both logs and exec requests, and haven’t been able to trigger it on any of my dev hosts, or on an EC2 instance - so anything that might help me hit it on demand would be great.
I’m testing by doing this, which I believe should be pretty similar to what you’re doing:
I can leave that running for hours and not see any issue 😦