k3s: Random webhook failures with EOF

Environmental Info: K3s Version:

k3s version v1.23.8+k3s1 (53f2d4e7) go version go1.17.5

Node(s) CPU architecture, OS, and Version:

Linux k8s-controller-1 5.15.0-39-generic #42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server, 3 agent

Describe the bug:

When FluxCD tries to do a dry-run on resources where webhook is configured (e.g.: cert-manager) the dry-run fails with EOF error. On the webhook side it looks like it’s not a proper https/tls connection: first record does not look like a TLS handshake

Steps To Reproduce:

  • Installed K3s:
        k3sup install --ip {{ ansible_default_ipv4.address }}
        --ssh-key {{ lookup('env','HOME') }}/.ssh/id_ed25519
        --user {{ ansible_ssh_user }}
        --k3s-extra-args '
          --disable traefik 
          --disable servicelb
          --disable-cloud-controller
          --node-taint CriticalAddonsOnly=true:NoExecute
        '
        --k3s-channel stable

        k3sup join --ip {{ ansible_default_ipv4.address }}
        --server-ip {{  hostvars[groups['master'][0]].ansible_default_ipv4.address }}
        --ssh-key {{ lookup('env','HOME') }}/.ssh/id_ed25519
        --user {{ ansible_ssh_user }}
        --k3s-channel stable

Expected behavior:

Dry run succeeds

Actual behavior:

Dry run fails sometimes with the same configuration

Additional context / logs:

Dry run error:

dry-run failed, reason: InternalError, error: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "[https://kube-prometheus-stack-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s](https://kube-prometheus-stack-operator.monitoring.svc/admission-prometheusrules/validate?timeout=10s)": EOF

k3s logs from journalctl

k3s[682]: W0710 08:03:32.023261     682 dispatcher.go:195] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.network.svc:443/mutate?timeout=10s": EOF

cert-manager webhook pod logs

I0710 08:03:32.022971       1 logs.go:59] http: TLS handshake error from 10.42.0.0:50900: tls: first record does not look like a TLS handshake

Backporting

  • Needs backporting to older releases

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 37 (19 by maintainers)

Commits related to this issue

Most upvoted comments

A fix for this should make it into this month’s release cycle. Thanks for your patience everyone, and providing me with enough info to finally reproduce it reliably.

We’re about to release RCs for today’s upstream patch releases, which will include some updates to the egress-selector code and remotedialer library. I’m curious if this will address the issues you’re seeing.

We’re also seeing this on Ubuntu 20.04 Kernel v5.13.0-37 from the HWE line. Are we potentially looking into this being a K3s<>Ubuntu combo issue?

That is doubtful, I am seeing the issue using Fedora Server as well.

What kind of cpu/memory/disk resources do these nodes have? What pod are you execing into; is one one of the packaged ones or something that you’ve deployed yourself? Are you running the kubectl command locally, or on another host?

I’ve tried load-testing both logs and exec requests, and haven’t been able to trigger it on any of my dev hosts, or on an EC2 instance - so anything that might help me hit it on demand would be great.

I’m testing by doing this, which I believe should be pretty similar to what you’re doing:

while true; do kubectl exec -n kube-system local-path-provisioner-5b5579c644-sphmc -- ls -la >/dev/null; done

I can leave that running for hours and not see any issue 😦