kubeadm: Upgrade/apply fatal timeout 1.19.4 to 1.20.6

What keywords did you search in kubeadm issues before filing this one?

fatal timeout invalid bearer token waiting to restart

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.20.6 {Major:“1”, Minor:“20”, GitVersion:“v1.20.6”, GitCommit:“8a62859e515889f07e3e3be6a1080413f17cf2c3”, GitTreeState:“clean”, BuildDate:“2021-04-15T03:26:21Z”, GoVersion:“go1.15.10”, Compiler:“gc”, Platform:“linux/amd64”}

Environment:

  • Kubernetes version (use kubectl version): 1.19.4 / 1.20.6 (upgrading from 1.19.4, on the master this is what shows upon failure) Client Version: version.Info{Major:“1”, Minor:“20”, GitVersion:“v1.20.5”, GitCommit:“6b1d87acf3c8253c123756b9e61dac642678305f”, GitTreeState:“clean”, BuildDate:“2021-03-18T01:10:43Z”, GoVersion:“go1.15.8”, Compiler:“gc”, Platform:“linux/amd64”}

  • Cloud provider or hardware configuration: Bare-metal plus VMs

  • OS (e.g. from /etc/os-release): Ubuntu 20.04.2 LTS

What happened?

The kubeadm upgrade apply v1.20.6 command will not get past this on my master:

[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-borg0.ci.net hash: 9160e7ddf2ec811c44ee54195ce49d0d
Static pod: kube-apiserver-borg0.ci.net hash: 9160e7ddf2ec811c44ee54195ce49d0d
Static pod: kube-apiserver-borg0.ci.net hash: 9160e7ddf2ec811c44ee54195ce49d0d
timed out waiting for the condition
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state.

What you expected to happen?

Success, or at least a diagnostic error message telling me what I might want to look at more closely.

How to reproduce it (as minimally and precisely as possible)?

In the hopes of being able to use kubeadm upgrade as a routine low-risk update process, just 5 months ago I built a new cluster and laboriously spent a week getting dozens of services running on this 1.19.4 instance. As far as I know, steps to reproduce are: install 1.19.4, run normal workloads, then invoke kubeadm upgrade apply. It’s a vanilla cluster with single master and three workers.

Anything else we need to know?

On the fourth attempt, I ran docker logs -f on the three containers it spun up. The one that seemed to give the best hint as to what the problem is: apiserver. It was generating 10 to 20 of these types of errors per second during the 5-minute wait for timeout:

E0503 14:56:02.622620       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
E0503 14:56:03.038727       1 status.go:71] apiserver received an error that is not an metav1.Status: 3
E0503 14:56:03.039625       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
E0503 14:56:03.122677       1 status.go:71] apiserver received an error that is not an metav1.Status: 3
E0503 14:56:03.124287       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
E0503 14:56:03.354751       1 cacher.go:419] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: illegal base64 data at input byte 3; reinitializing...
E0503 14:56:03.371693       1 status.go:71] apiserver received an error that is not an metav1.Status: 3
E0503 14:56:03.372849       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]

The stderr output from kubeadm itself, at verbosity 5, is attached here:

kubeadm.log

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (12 by maintainers)

Most upvoted comments

It looks like PodSecurityPolicy is a setting that I experimented with a couple years ago and forgot all about. I have this commented-out note in the ClusterConfiguration of my kubeadm startup script:

      kind: ClusterConfiguration
      # TODO cluster will not bootstrap, has to be added after
      # apiServerExtraArgs:
      #  enable-admission-plugins: PodSecurityPolicy

Anything else here deserving of the priority/awaiting-more-evidence label, or have I given you everything you need?

  • the rollback procedure failing

Rollback seems to be OK (it takes 3-5 minutes to bring things back to steady-state)–I’ve gone through this 4 times and haven’t gotten into a non-working state. Note that this is not a zero-downtime process, a number of production-facing impacts do occur during the attempt to roll forward & back.