kubernetes: kubeadm 1.8.0 init fails with "/var/lib/kubelet is not empty"

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: On a fresh Ubuntu 16.04.3 system booted from the official cloud image, kubeadm init fails because /var/lib/kubelet exists.

root@kubemaster:~# kubeadm init
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.8.0
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks
[preflight] Some fatal errors occurred:
        /var/lib/kubelet is not empty
[preflight] If you know what you are doing, you can skip pre-flight checks with `--skip-preflight-checks`

What you expected to happen: kubeadm successfully initializes the cluster

How to reproduce it (as minimally and precisely as possible):

  1. Boot a new VM from the latest Ubuntu Cloud image
  2. apt-get install -y apt-transport-https docker.io
  3. Follow the kubeadm installation instructions
  4. kubeadm init

Anything else we need to know?: Contents of /var/lib/kubelet:

/var/lib/kubelet
/var/lib/kubelet/pki
/var/lib/kubelet/pki/kubelet.crt
/var/lib/kubelet/pki/kubelet.key

Environment:

  • Kubernetes version (use kubectl version):
    root@kubemaster:~# kubectl version
    Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
    The connection to the server localhost:8080 was refused - did you specify the right host or port?
    
    root@kubemaster:~# apt search kube
    Sorting... Done
    Full Text Search... Done
    kubeadm/kubernetes-xenial,now 1.8.0-01 amd64 [installed]
      Kubernetes Cluster Bootstrapping Tool
    
    kubectl/kubernetes-xenial,now 1.8.0-00 amd64 [installed,automatic]
      Kubernetes Command Line Tool
    
    kubelet/kubernetes-xenial,now 1.8.0-00 amd64 [installed,automatic]
      Kubernetes Node Agent
    
    kubernetes-cni/kubernetes-xenial,now 0.5.1-00 amd64 [installed,automatic]
      Kubernetes CNI
    
  • Cloud provider or hardware configuration: Hyper-V generation 1 virtual machine
  • OS (e.g. from /etc/os-release):
    NAME="Ubuntu"
    VERSION="16.04.3 LTS (Xenial Xerus)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 16.04.3 LTS"
    VERSION_ID="16.04"
    HOME_URL="http://www.ubuntu.com/"
    SUPPORT_URL="http://help.ubuntu.com/"
    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
    VERSION_CODENAME=xenial
    UBUNTU_CODENAME=xenial
    
  • Kernel (e.g. uname -a): Linux kubemaster 4.4.0-96-generic #119-Ubuntu SMP Tue Sep 12 14:59:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: none
  • Others: none

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 7
  • Comments: 43 (21 by maintainers)

Commits related to this issue

Most upvoted comments

Cause

This is related to the location where the kubelet persists its certificates while running in the background, waiting for config:

since kubeadm expects there to be a running kubelet prior to kubeadm init being called, it shouldn’t expect the kubelet’s --root-dir folder to be empty

Workaround

if you are scripting bootstrapping a known clean machine, there are a few possible workarounds until https://github.com/kubernetes/kubernetes/pull/53317 is released in 1.8.1 (any of the following work around this issue):

  • verify this is the only preflight check failure, then run the init or join command with --skip-preflight-checks=true
  • stop the kubelet service and remove /var/lib/kubelet/pki prior to running the init or join command
  • run kubeadm reset prior to running init or join

Resolution

addressed as part of https://github.com/kubernetes/kubernetes/pull/53317

@liggitt @jpetazzo

If i wipe out the /var/lib/kubelet/pki directory and do not restart kubelet, the kubeadm init process hangs with

`kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz/syncloop’ failed with error: Get http://localhost:10255/healthz/syncloop: dial tcp [::1]:10255: getsockopt: connection refused. [kubelet-check] It seems like the kubelet isn’t running or healthy. [kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz’ failed with error: Get http://localhost:10255/healthz: dial tcp [::1]:10255: getsockopt: connection refused. [kubelet-check] It seems like the kubelet isn’t running or healthy. [kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz/syncloop’ failed with error: Get http://localhost:10255/healthz/syncloop: dial tcp [::1]:10255: getsockopt: connection refused. [kubelet-check] It seems like the kubelet isn’t running or healthy. [kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz’ failed with error: Get http://localhost:10255/healthz: dial tcp [::1]:10255: getsockopt: connection refused.

Unfortunately, an error has occurred: timed out waiting for the condition

This error is likely caused by that: - The kubelet is not running - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled) - There is no internet connection; so the kubelet can’t pull the following control plane images: - gcr.io/google_containers/kube-apiserver-amd64:v1.8.0 - gcr.io/google_containers/kube-controller-manager-amd64:v1.8.0 - gcr.io/google_containers/kube-scheduler-amd64:v1.8.0

You can troubleshoot this for example with the following commands if you’re on a systemd-powered system: - ‘systemctl status kubelet’ - ‘journalctl -xeu kubelet’`

When i check the status of kubelet, it’s restarting and never starts because

ct 03 13:40:21 vagrant kubelet[15358]: I1003 13:40:21.489907   15358 controller.go:114] kubelet config controller: starting controller
Oct 03 13:40:21 vagrant kubelet[15358]: I1003 13:40:21.490020   15358 controller.go:118] kubelet config controller: validating combination o
Oct 03 13:40:21 vagrant kubelet[15358]: error: unable to load client CA file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no

So, i am guessing my only option is to skip preflight checks, let me try that out…

ok just to confirm, moving forward, ideally kubeadm init should not be checking for contents in /var/lib/kubelet (i.e. having contents is totally fine)

Correct. The kubeadm instructions start the kubelet and let it run in a crash loop in the background, waiting for config. In that state, the kubelet is free to write to its directory containing state (/var/lib/kubelet), so kubeadm should not require that directory to be empty in order to run kubeadm init

@wjrogers I did kubeadm reset && kubeadm init it worked for me!

this is fixed in v1.8.1

Hi All,

Need help here

current i’m installing Kubernetes version: v1.8.0 using the command “kubeadm reset && kubeadm init” it went hanged state. systemctl status kubelet log showing that

Oct 04 20:15:02 cluster-1 kubelet[18203]: W1004 20:15:02.263263 18203 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d Oct 04 20:15:02 cluster-1 kubelet[18203]: E1004 20:15:02.263758 18203 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:Net…nitialized Oct 04 20:15:02 cluster-1 kubelet[18203]: E1004 20:15:02.326988 18203 summary.go:92] Failed to get system container stats for “/system.slice/kubelet.serv…t.service” Oct 04 20:15:02 cluster-1 kubelet[18203]: E1004 20:15:02.327015 18203 summary.go:92] Failed to get system container stats for “/system.slice/docker.servi…r.service” Oct 04 20:15:07 cluster-1 kubelet[18203]: W1004 20:15:07.265211 18203 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d Oct 04 20:15:07 cluster-1 kubelet[18203]: E1004 20:15:07.265869 18203 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:Net…nitialized Oct 04 20:15:12 cluster-1 kubelet[18203]: W1004 20:15:12.267033 18203 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d Oct 04 20:15:12 cluster-1 kubelet[18203]: E1004 20:15:12.267222 18203 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:Net…nitialized Oct 04 20:15:12 cluster-1 kubelet[18203]: E1004 20:15:12.332802 18203 summary.go:92] Failed to get system container stats for “/system.slice/kubelet.serv…t.service” Oct 04 20:15:12 cluster-1 kubelet[18203]: E1004 20:15:12.332830 18203 summary.go:92] Failed to get system container stats for “/system.slice/docker.servi…r.service” Warning: kubelet.service changed on disk. Run ‘systemctl daemon-reload’ to reload units.

After that i’m tried to weave it failed with error

kubectl apply -f “https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d ‘\n’)” The connection to the server localhost:8080 was refused - did you specify the right host or port? W1004 18:53:33.930337 16144 factory_object_mapping.go:423] Failed to download OpenAPI (Get http://localhost:8080/swagger-2.0.0.pb-v1: dial tcp [::1]:8080: getsockopt: connection refused), falling back to swagger The connection to the server localhost:8080 was refused - did you specify the right host or port?

Please anyone can help me on this

(Reposting my earlier tip since it worked like a charm for my automated deployment scripts.)

For the time being, before running kubeadmin init or kubeadmin join, I’ll:

  • check if /etc/kubernetes/kubelet.conf or /etc/kubernetes/admin.conf exists
  • if they don’t exist, stop kubelet (systemctl stop kubelet), and wipe out /var/lib/kubelet/pki (rm -rf /var/lib/kubelet/pki)

This might or might not be appropriate for your use-cases, but for mine it works great (I’m automatically deploying hundreds of k8s clusters of training purposes). Stopping kubelet is necessary to avoid race conditions where it would recreate the pki directory before you run kubeadm.

there are plenty of other pre-flight checks that are valuable and you don’t want to skip, it’s just the check for an empty /var/lib/kubelet that is incorrect.

if you are scripting bootstrapping a known clean machine, there are a couple possible workarounds until https://github.com/kubernetes/kubernetes/pull/53317 is released in 1.8.1:

  • run kubeadm init/kubeadm join skipping preflight checks
  • stop the kubelet service, remove /var/lib/kubelet/pki, then run kubeadm init/kubeadm join

Everything was done by: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/ but now it is even getting better: The HTTP call equal to ‘curl -sSL http://localhost:10255/healthz’ failed with error: Get http://localhost:10255/healthz: dial tcp [::1]:10255: getsockopt: connection refused you should remove 1.8 and put it do “back to the drawing board” stadium. It is useless even to test this 1.8