kubeadm: Upgrading from 1.9.6 to 1.10.0 fails with timeout
BUG REPORT
Versions
kubeadm version (use kubeadm version
):
kubeadm version: &version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.0”, GitCommit:“fc32d2f3698e36b93322a3465f63a14e9f0eaead”, GitTreeState:“clean”, BuildDate:“2018-03-26T16:44:10Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”}
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.6”, GitCommit:“9f8ebd171479bec0ada837d7ee641dec2f8c6dd1”, GitTreeState:“clean”, BuildDate:“2018-03-21T15:21:50Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.6”, GitCommit:“9f8ebd171479bec0ada837d7ee641dec2f8c6dd1”, GitTreeState:“clean”, BuildDate:“2018-03-21T15:13:31Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”}
- Cloud provider or hardware configuration:
Scaleway baremetal C2S
- OS (e.g. from /etc/os-release):
Ubuntu Xenial (16.04 LTS) (GNU/Linux 4.4.122-mainline-rev1 x86_64 )
- Kernel (e.g.
uname -a
):
Linux amd64-master-1 4.4.122-mainline-rev1 #1 SMP Sun Mar 18 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
What happened?
Trying to upgrade from 1.9.6 to 1.10.0 I’m getting this error:
kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests411909119/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [arm-master-1] and IPs [10.1.244.57]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests180476754/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state
What you expected to happen?
Successful upgrade
How to reproduce it (as minimally and precisely as possible)?
Install 1.9.6 packages and init a 1.9.6 cluster:
curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
apt-get update -qq
apt-get install -qy kubectl=1.9.6-00
apt-get install -qy kubelet=1.9.6-00
apt-get install -qy kubeadm=1.9.6-00
Edit the kubeadm-config and change the featureGates from string to map as reported in https://github.com/kubernetes/kubernetes/issues/61764 .
kubectl -n kube-system edit cm kubeadm-config
....
featureGates: {}
....
Download kubeadm 1.10.0 and run kubeadm upgrade plan
and kubeadm upgrade apply v1.10.0
.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 5
- Comments: 42 (20 by maintainers)
Temporary workaround is to ensure certs and upgrade the etcd and apiserver pods by bypassing the checks.
Be sure to check your Config and add any flags for your use case:
@codepainters I think it is the same issue.
There are a few underlying problems causing this issue:
As a result, the upgrade only succeeds currently when there happens to be a pod status update for the etcd pod that causes the hash to change prior to the kubelet picking up the new static manifest for etcd. Additionally, the api server needs to remain available for the first part of the apiserver upgrade when the upgrade tooling is querying the api prior to updating the apiserver manifest.
@renich just give it the filepath of your config
If you don’t use any custom settings, you can pass it an empty file. Here’s a simple way to do that in bash:
@kvaps @stealthybox this is most likely
etcd
issue (kubeadm
speaks plainHTTP/2
to TLS-enabledetcd
), I hit it too. See this other issue: https://github.com/kubernetes/kubeadm/issues/755Honestly, I can’t understand why is the same TCP port used for both TLS and non-TLS
etcd
listeners, it only causes troubles like this one. Getting plain, old connection refused would give immediate hint, here I had to resort totcpdump
to understand what’s going on.This should now be resolved with the merging of https://github.com/kubernetes/kubernetes/pull/62655 and will be part of the v1.10.2 release.
Thanks @stealthybox For me the
upgrade apply
process stalled on[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"...
however the cluster was successfully upgraded.PR to address the static pod update race condition: https://github.com/kubernetes/kubernetes/pull/61942 cherry-pick PR for release-1.10 branch: https://github.com/kubernetes/kubernetes/pull/61954
@detiber and I got on a call to discuss changes we need to make to the upgrade process. We plan to implement 3 fixes for this bug in the 1.10.x patch releases:
Remove etcd TLS from upgrade.
The current upgrade loop does batch modifications per component in a serial manner.
Upgrading a component has no knowledge of dependent component configurations.
Verifying an upgrade requires the APIServer is available to check the pod status.
Etcd TLS requires a coupled etcd+apiserver configuration change which breaks this contract.
This is the minimum viable change to fix this issue, and leaves upgraded clusters with insecure etcd.
Fix the mirror-pod hash race condition on pod status change.
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/upgrade/staticpods.go#L189.
Upgrades will now be correct assuming compatibility between etcd and apiserver flags.
Upgrade TLS specifically in a separate phase.
Etcd and the APIServer need to be upgraded together.
kubeadm alpha phase ensure-etcd-tls
?.This phase should be runnable independently of a cluster upgrade.
During a cluster upgrade, this phase should run before updating all of the components.
For 1.11 we want to:
It’s undesirable to rely on the apiserver and etcd for monitoring local processes like we are currently doing.
A local source of data about pods is superior to relying on higher-order distributed kubernetes components.
This will replace the current pod runtime checks in the upgrade loop.
This will allow us to add checks to the ensure-etcd-tls phase.
alternative: Use the CRI to get pod info (demo’d viable using
crictl
).caveat: CRI on dockershim and possibly other container runtimes does not currently support backward compatibility for CRI breaking changes.
I just hit another weird edge case related to this bug. The kubeadm upgrade marked the etcd upgrade as complete prior to the new etcd image being pulled and the new static pod being deployed. This causes the upgrade to timeout at a later step and the upgrade rollback to fail. This also leaves the cluster in a broken state. Restoring the original etcd static pod manifest is needed to recover the cluster.
After retrying this for 10 times it finally worked