kubernetes: kubeadm 1.8.0 init fails with "/var/lib/kubelet is not empty"

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: On a fresh Ubuntu 16.04.3 system booted from the official cloud image, kubeadm init fails because /var/lib/kubelet exists.

root@kubemaster:~# kubeadm init
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.8.0
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks
[preflight] Some fatal errors occurred:
        /var/lib/kubelet is not empty
[preflight] If you know what you are doing, you can skip pre-flight checks with `--skip-preflight-checks`

What you expected to happen: kubeadm successfully initializes the cluster

How to reproduce it (as minimally and precisely as possible):

Boot a new VM from the latest Ubuntu Cloud image
apt-get install -y apt-transport-https docker.io
Follow the kubeadm installation instructions
kubeadm init

Anything else we need to know?: Contents of /var/lib/kubelet:

/var/lib/kubelet
/var/lib/kubelet/pki
/var/lib/kubelet/pki/kubelet.crt
/var/lib/kubelet/pki/kubelet.key

Environment:

Kubernetes version (use kubectl version):

root@kubemaster:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
The connection to the server localhost:8080 was refused - did you specify the right host or port?

root@kubemaster:~# apt search kube
Sorting... Done
Full Text Search... Done
kubeadm/kubernetes-xenial,now 1.8.0-01 amd64 [installed]
  Kubernetes Cluster Bootstrapping Tool

kubectl/kubernetes-xenial,now 1.8.0-00 amd64 [installed,automatic]
  Kubernetes Command Line Tool

kubelet/kubernetes-xenial,now 1.8.0-00 amd64 [installed,automatic]
  Kubernetes Node Agent

kubernetes-cni/kubernetes-xenial,now 0.5.1-00 amd64 [installed,automatic]
  Kubernetes CNI

Cloud provider or hardware configuration: Hyper-V generation 1 virtual machine

OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Kernel (e.g. uname -a): Linux kubemaster 4.4.0-96-generic #119-Ubuntu SMP Tue Sep 12 14:59:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: none
Others: none

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 7
Comments: 43 (21 by maintainers)

Commits related to this issue

Merge pull request #53436 from liggitt/kubeadm-init Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github... — committed to kubernetes/kubernetes by deleted user 7 years ago
Merge pull request #53317 from liggitt/fix-kubelet-cert-dir Automatic merge from submit-queue (batch tested with PRs 53317, 52186). If you want to cherry-pick this change to another branch, please fo... — committed to kubernetes/kubernetes by deleted user 7 years ago

Most upvoted comments

Cause

This is related to the location where the kubelet persists its certificates while running in the background, waiting for config:

the kubeadm 1.8.0-0 package accidentally used a location that is erased on reboot
the kubeadm 1.8.0-1 package corrected that to use an appropriate location for the certificates, but starting the kubelet before running kubeadm init causes files to be generated into the folder kubeadm expects to be empty

since kubeadm expects there to be a running kubelet prior to kubeadm init being called, it shouldn’t expect the kubelet’s --root-dir folder to be empty

Workaround

if you are scripting bootstrapping a known clean machine, there are a few possible workarounds until https://github.com/kubernetes/kubernetes/pull/53317 is released in 1.8.1 (any of the following work around this issue):

verify this is the only preflight check failure, then run the init or join command with --skip-preflight-checks=true
stop the kubelet service and remove /var/lib/kubelet/pki prior to running the init or join command
run kubeadm reset prior to running init or join

Resolution

addressed as part of https://github.com/kubernetes/kubernetes/pull/53317

liggitt on Oct 4, 2017

@liggitt @jpetazzo

If i wipe out the /var/lib/kubelet/pki directory and do not restart kubelet, the kubeadm init process hangs with

`kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz/syncloop’ failed with error: Get http://localhost:10255/healthz/syncloop: dial tcp [::1]:10255: getsockopt: connection refused. [kubelet-check] It seems like the kubelet isn’t running or healthy. [kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz’ failed with error: Get http://localhost:10255/healthz: dial tcp [::1]:10255: getsockopt: connection refused. [kubelet-check] It seems like the kubelet isn’t running or healthy. [kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz/syncloop’ failed with error: Get http://localhost:10255/healthz/syncloop: dial tcp [::1]:10255: getsockopt: connection refused. [kubelet-check] It seems like the kubelet isn’t running or healthy. [kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz’ failed with error: Get http://localhost:10255/healthz: dial tcp [::1]:10255: getsockopt: connection refused.

Unfortunately, an error has occurred: timed out waiting for the condition

This error is likely caused by that: - The kubelet is not running - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled) - There is no internet connection; so the kubelet can’t pull the following control plane images: - gcr.io/google_containers/kube-apiserver-amd64:v1.8.0 - gcr.io/google_containers/kube-controller-manager-amd64:v1.8.0 - gcr.io/google_containers/kube-scheduler-amd64:v1.8.0

You can troubleshoot this for example with the following commands if you’re on a systemd-powered system: - ‘systemctl status kubelet’ - ‘journalctl -xeu kubelet’`

When i check the status of kubelet, it’s restarting and never starts because

ct 03 13:40:21 vagrant kubelet[15358]: I1003 13:40:21.489907   15358 controller.go:114] kubelet config controller: starting controller
Oct 03 13:40:21 vagrant kubelet[15358]: I1003 13:40:21.490020   15358 controller.go:118] kubelet config controller: validating combination o
Oct 03 13:40:21 vagrant kubelet[15358]: error: unable to load client CA file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no

So, i am guessing my only option is to skip preflight checks, let me try that out…

rahulmishra on Oct 3, 2017

ok just to confirm, moving forward, ideally kubeadm init should not be checking for contents in /var/lib/kubelet (i.e. having contents is totally fine)

Correct. The kubeadm instructions start the kubelet and let it run in a crash loop in the background, waiting for config. In that state, the kubelet is free to write to its directory containing state (/var/lib/kubelet), so kubeadm should not require that directory to be empty in order to run kubeadm init

liggitt on Oct 3, 2017

@wjrogers I did kubeadm reset && kubeadm init it worked for me!

surajssd on Oct 3, 2017

this is fixed in v1.8.1

liggitt on Oct 12, 2017

Hi All,

Need help here

current i’m installing Kubernetes version: v1.8.0 using the command “kubeadm reset && kubeadm init” it went hanged state. systemctl status kubelet log showing that

Oct 04 20:15:02 cluster-1 kubelet[18203]: W1004 20:15:02.263263 18203 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d Oct 04 20:15:02 cluster-1 kubelet[18203]: E1004 20:15:02.263758 18203 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:Net…nitialized Oct 04 20:15:02 cluster-1 kubelet[18203]: E1004 20:15:02.326988 18203 summary.go:92] Failed to get system container stats for “/system.slice/kubelet.serv…t.service” Oct 04 20:15:02 cluster-1 kubelet[18203]: E1004 20:15:02.327015 18203 summary.go:92] Failed to get system container stats for “/system.slice/docker.servi…r.service” Oct 04 20:15:07 cluster-1 kubelet[18203]: W1004 20:15:07.265211 18203 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d Oct 04 20:15:07 cluster-1 kubelet[18203]: E1004 20:15:07.265869 18203 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:Net…nitialized Oct 04 20:15:12 cluster-1 kubelet[18203]: W1004 20:15:12.267033 18203 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d Oct 04 20:15:12 cluster-1 kubelet[18203]: E1004 20:15:12.267222 18203 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:Net…nitialized Oct 04 20:15:12 cluster-1 kubelet[18203]: E1004 20:15:12.332802 18203 summary.go:92] Failed to get system container stats for “/system.slice/kubelet.serv…t.service” Oct 04 20:15:12 cluster-1 kubelet[18203]: E1004 20:15:12.332830 18203 summary.go:92] Failed to get system container stats for “/system.slice/docker.servi…r.service” Warning: kubelet.service changed on disk. Run ‘systemctl daemon-reload’ to reload units.

After that i’m tried to weave it failed with error

kubectl apply -f “https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d ‘\n’)” The connection to the server localhost:8080 was refused - did you specify the right host or port? W1004 18:53:33.930337 16144 factory_object_mapping.go:423] Failed to download OpenAPI (Get http://localhost:8080/swagger-2.0.0.pb-v1: dial tcp [::1]:8080: getsockopt: connection refused), falling back to swagger The connection to the server localhost:8080 was refused - did you specify the right host or port?

Please anyone can help me on this

hemaprasad on Oct 4, 2017

(Reposting my earlier tip since it worked like a charm for my automated deployment scripts.)

For the time being, before running kubeadmin init or kubeadmin join, I’ll:

check if /etc/kubernetes/kubelet.conf or /etc/kubernetes/admin.conf exists
if they don’t exist, stop kubelet (systemctl stop kubelet), and wipe out /var/lib/kubelet/pki (rm -rf /var/lib/kubelet/pki)

This might or might not be appropriate for your use-cases, but for mine it works great (I’m automatically deploying hundreds of k8s clusters of training purposes). Stopping kubelet is necessary to avoid race conditions where it would recreate the pki directory before you run kubeadm.

jpetazzo on Oct 3, 2017

there are plenty of other pre-flight checks that are valuable and you don’t want to skip, it’s just the check for an empty /var/lib/kubelet that is incorrect.

if you are scripting bootstrapping a known clean machine, there are a couple possible workarounds until https://github.com/kubernetes/kubernetes/pull/53317 is released in 1.8.1:

run kubeadm init/kubeadm join skipping preflight checks
stop the kubelet service, remove /var/lib/kubelet/pki, then run kubeadm init/kubeadm join

liggitt on Oct 3, 2017

Everything was done by: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/ but now it is even getting better: The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz’ failed with error: Get http://localhost:10255/healthz: dial tcp [::1]:10255: getsockopt: connection refused you should remove 1.8 and put it do “back to the drawing board” stadium. It is useless even to test this 1.8

vglisin on Oct 3, 2017