kubeadm: Upgrading from 1.9.6 to 1.10.0 fails with timeout

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.0”, GitCommit:“fc32d2f3698e36b93322a3465f63a14e9f0eaead”, GitTreeState:“clean”, BuildDate:“2018-03-26T16:44:10Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”}

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.6”, GitCommit:“9f8ebd171479bec0ada837d7ee641dec2f8c6dd1”, GitTreeState:“clean”, BuildDate:“2018-03-21T15:21:50Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.6”, GitCommit:“9f8ebd171479bec0ada837d7ee641dec2f8c6dd1”, GitTreeState:“clean”, BuildDate:“2018-03-21T15:13:31Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”}

Cloud provider or hardware configuration:

Scaleway baremetal C2S

OS (e.g. from /etc/os-release):

Ubuntu Xenial (16.04 LTS) (GNU/Linux 4.4.122-mainline-rev1 x86_64 )

Kernel (e.g. uname -a):

Linux amd64-master-1 4.4.122-mainline-rev1 #1 SMP Sun Mar 18 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

What happened?

Trying to upgrade from 1.9.6 to 1.10.0 I’m getting this error:

kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests411909119/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [arm-master-1] and IPs [10.1.244.57]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests180476754/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

What you expected to happen?

Successful upgrade

How to reproduce it (as minimally and precisely as possible)?

Install 1.9.6 packages and init a 1.9.6 cluster:

curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
apt-get update -qq
apt-get install -qy kubectl=1.9.6-00
apt-get install -qy kubelet=1.9.6-00
apt-get install -qy kubeadm=1.9.6-00

Edit the kubeadm-config and change the featureGates from string to map as reported in https://github.com/kubernetes/kubernetes/issues/61764 .

kubectl -n kube-system edit cm kubeadm-config

....
featureGates: {}
....

Download kubeadm 1.10.0 and run kubeadm upgrade plan and kubeadm upgrade apply v1.10.0.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 5
Comments: 42 (20 by maintainers)

Most upvoted comments

Temporary workaround is to ensure certs and upgrade the etcd and apiserver pods by bypassing the checks.

Be sure to check your Config and add any flags for your use case:

kubectl -n kube-system edit cm kubeadm-config  # change featureFlags
...
  featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

stealthybox on Apr 19, 2018

@codepainters I think it is the same issue.

There are a few underlying problems causing this issue:

The upgrade is generating a hash of the mirror pod for each component from the result of querying the mirror pod from the API. The upgrade then tests to see if this hashed value changes to determine if the pod is updated from the static manifest change. The hashed value includes fields that can be mutated for reasons other than the static manifest change (such as pod status updates). If the pod status changes between the hash comparisons, then the upgrade will continue to the next component prematurely.
The upgrade performs the etcd static pod manifest update (including adding tls security to etcd) and attempts to use the apiserver to verify that the pod has been updated, however the apiserver manifest has not been updated at this point to use tls for communicating with etcd.

As a result, the upgrade only succeeds currently when there happens to be a pod status update for the etcd pod that causes the hash to change prior to the kubelet picking up the new static manifest for etcd. Additionally, the api server needs to remain available for the first part of the apiserver upgrade when the upgrade tooling is querying the api prior to updating the apiserver manifest.

detiber on Mar 29, 2018

@renich just give it the filepath of your config

If you don’t use any custom settings, you can pass it an empty file. Here’s a simple way to do that in bash:

1.10_kubernetes/server/bin/kubeadm alpha phase upload-config --config <(echo)

stealthybox on Apr 24, 2018

@kvaps @stealthybox this is most likely etcd issue (kubeadm speaks plain HTTP/2 to TLS-enabled etcd), I hit it too. See this other issue: https://github.com/kubernetes/kubeadm/issues/755

Honestly, I can’t understand why is the same TCP port used for both TLS and non-TLS etcd listeners, it only causes troubles like this one. Getting plain, old connection refused would give immediate hint, here I had to resort to tcpdump to understand what’s going on.

codepainters on Apr 17, 2018

This should now be resolved with the merging of https://github.com/kubernetes/kubernetes/pull/62655 and will be part of the v1.10.2 release.

detiber on Apr 26, 2018

Thanks @stealthybox For me the upgrade apply process stalled on [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"... however the cluster was successfully upgraded.

DylanGraham on Apr 17, 2018

PR to address the static pod update race condition: https://github.com/kubernetes/kubernetes/pull/61942 cherry-pick PR for release-1.10 branch: https://github.com/kubernetes/kubernetes/pull/61954

detiber on Mar 30, 2018

@detiber and I got on a call to discuss changes we need to make to the upgrade process. We plan to implement 3 fixes for this bug in the 1.10.x patch releases:

Remove etcd TLS from upgrade.
The current upgrade loop does batch modifications per component in a serial manner.
Upgrading a component has no knowledge of dependent component configurations.
Verifying an upgrade requires the APIServer is available to check the pod status.
Etcd TLS requires a coupled etcd+apiserver configuration change which breaks this contract.
This is the minimum viable change to fix this issue, and leaves upgraded clusters with insecure etcd.
Fix the mirror-pod hash race condition on pod status change.
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/upgrade/staticpods.go#L189.
Upgrades will now be correct assuming compatibility between etcd and apiserver flags.
Upgrade TLS specifically in a separate phase.
Etcd and the APIServer need to be upgraded together.
kubeadm alpha phase ensure-etcd-tls ?.
This phase should be runnable independently of a cluster upgrade.
During a cluster upgrade, this phase should run before updating all of the components.

For 1.11 we want to:

Use the kubelet API for runtime checks of upgraded static pods.
It’s undesirable to rely on the apiserver and etcd for monitoring local processes like we are currently doing.
A local source of data about pods is superior to relying on higher-order distributed kubernetes components.
This will replace the current pod runtime checks in the upgrade loop.
This will allow us to add checks to the ensure-etcd-tls phase.

alternative: Use the CRI to get pod info (demo’d viable using crictl).
caveat: CRI on dockershim and possibly other container runtimes does not currently support backward compatibility for CRI breaking changes.

stealthybox on Mar 29, 2018

I just hit another weird edge case related to this bug. The kubeadm upgrade marked the etcd upgrade as complete prior to the new etcd image being pulled and the new static pod being deployed. This causes the upgrade to timeout at a later step and the upgrade rollback to fail. This also leaves the cluster in a broken state. Restoring the original etcd static pod manifest is needed to recover the cluster.

detiber on Mar 28, 2018

After retrying this for 10 times it finally worked

stefanprodan on Mar 28, 2018