k3s: Crash loop due to apiserver healthcheck error with poststarthook/rbac/bootstrap-roles
Environmental Info: K3s Version: k3s version v1.19.2+k3s1 (d38505b1)
Node(s) CPU architecture, OS, and Version: Linux k3os-16698 5.4.0-48-generic #52 SMP Sat Sep 26 08:27:15 UTC 2020 x86_64 GNU/Linux
Cluster Configuration: single-node
Describe the bug:
Shortly after k3s starts, a failure with apiserver triggers a crash, the log below shows poststarthook/rbac/bootstrap-roles failed: reason withheld, this is followed by a fairly large traceback and the service starts again.
Failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/bootstrap-controller ok\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding
Steps To Reproduce:
The last thing to change in the cluster was disabling the deployment of coredns --no-deploy coredns in favor of an existing helm installation of coredns. I don’t see a connection between this new failure mode and coredns.
However, I’m unable to identify exactly what triggered this failure, I can provide the coredns configuration if required.
Expected behavior:
K3s should start without apiserver health check errors and subsequent crashing.
Actual behavior:
K3s fails to fully instantiate due to the apiserver health check failure and subsequent crashloop.
Additional context / logs:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 27 (9 by maintainers)
Commits related to this issue
- Downgrade server to work around cronjob bug, disable CC to work around startup bug https://github.com/k3s-io/k3s/issues/2704 https://github.com/k3s-io/k3s/issues/2425 — committed to parente/homelab by parente 3 years ago
- Fix k3s restart crash loop: https://github.com/k3s-io/k3s/issues/2425#issuecomment-735298338 — committed to jjbubudi/rpi-provision by jjbubudi 3 years ago
I ended up having the same issue after an upgrade. But after testing I realized that after multiple restarts of k3s, it sometimes starts without crashing. All crashing logs ends with the same line:
After disabling cloud-controller-manager by adding
--disable-cloud-controller, k3s now starts every time without crashing. So perhaps there is a race condition in some way, cloud-controller-manager is perhaps starting too early and the 10 second timeout is too short?cc @brandond that I saw mention this in #2477
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
I was able to solve the restart loop by adding flag
--disable-cloud-controllertoExecStartin/etc/systemd/system/k3s.service. This seems to be preserved between restarts and upgrades and all my pods are running fine. I didn’t create an external cloud controller, so for now it’s running fine without one at all. I’m sure there’s some negative implication here that I’m missing but for now it seems to meet my needs. Hopefully this helps anyone else running into this problem.@AlessioCasco I have not seen this issue present on anything except for very low-end arm hardware. Can you please open a new issue with information on your environment, and attach K3s logs from your nodes?