kubeadm: Upgrade/apply fatal timeout 1.19.4 to 1.20.6
What keywords did you search in kubeadm issues before filing this one?
fatal
timeout
invalid bearer token
waiting to restart
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use kubeadm version
): 1.20.6
{Major:“1”, Minor:“20”, GitVersion:“v1.20.6”, GitCommit:“8a62859e515889f07e3e3be6a1080413f17cf2c3”, GitTreeState:“clean”, BuildDate:“2021-04-15T03:26:21Z”, GoVersion:“go1.15.10”, Compiler:“gc”, Platform:“linux/amd64”}
Environment:
-
Kubernetes version (use
kubectl version
): 1.19.4 / 1.20.6 (upgrading from 1.19.4, on the master this is what shows upon failure) Client Version: version.Info{Major:“1”, Minor:“20”, GitVersion:“v1.20.5”, GitCommit:“6b1d87acf3c8253c123756b9e61dac642678305f”, GitTreeState:“clean”, BuildDate:“2021-03-18T01:10:43Z”, GoVersion:“go1.15.8”, Compiler:“gc”, Platform:“linux/amd64”} -
Cloud provider or hardware configuration: Bare-metal plus VMs
-
OS (e.g. from /etc/os-release): Ubuntu 20.04.2 LTS
What happened?
The kubeadm upgrade apply v1.20.6
command will not get past this on my master:
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-borg0.ci.net hash: 9160e7ddf2ec811c44ee54195ce49d0d
Static pod: kube-apiserver-borg0.ci.net hash: 9160e7ddf2ec811c44ee54195ce49d0d
Static pod: kube-apiserver-borg0.ci.net hash: 9160e7ddf2ec811c44ee54195ce49d0d
timed out waiting for the condition
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state.
What you expected to happen?
Success, or at least a diagnostic error message telling me what I might want to look at more closely.
How to reproduce it (as minimally and precisely as possible)?
In the hopes of being able to use kubeadm upgrade
as a routine low-risk update process, just 5 months ago I built a new cluster and laboriously spent a week getting dozens of services running on this 1.19.4 instance. As far as I know, steps to reproduce are: install 1.19.4, run normal workloads, then invoke kubeadm upgrade apply
. It’s a vanilla cluster with single master and three workers.
Anything else we need to know?
On the fourth attempt, I ran docker logs -f
on the three containers it spun up. The one that seemed to give the best hint as to what the problem is: apiserver. It was generating 10 to 20 of these types of errors per second during the 5-minute wait for timeout:
E0503 14:56:02.622620 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
E0503 14:56:03.038727 1 status.go:71] apiserver received an error that is not an metav1.Status: 3
E0503 14:56:03.039625 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
E0503 14:56:03.122677 1 status.go:71] apiserver received an error that is not an metav1.Status: 3
E0503 14:56:03.124287 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
E0503 14:56:03.354751 1 cacher.go:419] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: illegal base64 data at input byte 3; reinitializing...
E0503 14:56:03.371693 1 status.go:71] apiserver received an error that is not an metav1.Status: 3
E0503 14:56:03.372849 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
The stderr output from kubeadm itself, at verbosity 5, is attached here:
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (12 by maintainers)
It looks like PodSecurityPolicy is a setting that I experimented with a couple years ago and forgot all about. I have this commented-out note in the ClusterConfiguration of my kubeadm startup script:
Anything else here deserving of the
priority/awaiting-more-evidence
label, or have I given you everything you need?Rollback seems to be OK (it takes 3-5 minutes to bring things back to steady-state)–I’ve gone through this 4 times and haven’t gotten into a non-working state. Note that this is not a zero-downtime process, a number of production-facing impacts do occur during the attempt to roll forward & back.