rancher: Error: failed to start containers: kubelet

Rancher versions: rancher/server or rancher/rancher: rancher/rancher:latest@sha256:38839bb19bdcac084a413a4edce7efb97ab99b6d896bda2f433dfacfd27f8770 rancher/agent or rancher/rancher-agent: rancher/rancher-agent:v2.0.0 Infrastructure Stack versions: whatever the defaults are

Docker version: (docker version,docker info preferred) Client: Version: 17.03.2-ce API version: 1.27 Go version: go1.8.3 Git commit: f5ec1e2-snap-345b814 Built: Thu Jun 29 23:40:29 2017 OS/Arch: linux/amd64

Server: Version: 17.03.2-ce API version: 1.27 (minimum version 1.12) Go version: go1.8.3 Git commit: f5ec1e2-snap-345b814 Built: Thu Jun 29 23:40:29 2017 OS/Arch: linux/amd64 Experimental: false Operating system and kernel: (cat /etc/os-release, uname -r preferred) 4.15.0-20-generic Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) ssdnodes/custom cluster Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB) single node Environment Template: (Cattle/Kubernetes/Swarm/Mesos) can’t set up node Steps to Reproduce: snap install docker --channel=17.03/stable mkdir /etc/kubernetes sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.0.0 --server https://rancher.sixcorners.info --token abc --ca-checksum xyz --worker docker logs -f share-mnt Results: Found state.json: 931882e24ff0ef67b0e8744dbf1f7e04fd68afe714a29a2522293312824f3c51 time=“2018-05-06T06:09:15Z” level=info msg=“Execing [/usr/bin/nsenter --mount=/proc/21787/ns/mnt -F – /var/snap/docker/common/var-lib-docker/aufs/mnt/5d00bd40adec6662aaec8ea2a5f5ce6a332e9dbfad087a008c5c89b7cac4c22f/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher – norun]” Error response from daemon: {“message”:“No such container: kubelet”} Error: failed to start containers: kubelet Error response from daemon: {“message”:“No such container: kubelet”} Error: failed to start containers: kubelet Error response from daemon: {“message”:“No such container: kubelet”} Error: failed to start containers: kubelet

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 24
  • Comments: 91 (11 by maintainers)

Most upvoted comments

I have seen this bug many times in many different places/circumstances. The only way I’ve found to fix it is by destroying the cluster and recreating it.

Hi, I also had this error because my VMs had the same name. After changing the name and cleaning entire VM (volume prune, system prune etc.) everything went back to normal.

I am seeing the same issue when attempting to re-register a node. The steps I use are:

  1. Create a custom cluster
  2. Register a node
  3. Delete the node from the rancher console
  4. Attempt to clean up the node with the following: https://pastebin.com/AuQUJiM4
  5. Reregister the node

In this case the rancher server reports:

2018/06/27 14:45:35 [ERROR] ClusterController c-v2998 [cluster-agent-controller] failed with : could not contact server: Get https://x.x.x.x:6443/version: dial tcp 127.0.0.1:6443: getsockopt: connection refused

The node has only 2 containers:

share-mnt: Error response from daemon: {“message”:“No such container: kubelet”} Error: failed to start containers: kubelet quirky_hopper: … time=“2018-06-27T14:57:46Z” level=info msg=“waiting for node to register” time=“2018-06-27T14:57:48Z” level=info msg=“Starting plan monitor”

The only way to proceed at this point is stop all activity on the node, delete the node from the rancher ui, and recreate the cluster and re register.

Running server 2.0.4 with only the one node, Docker version 17.03.2-ce on Ubuntu 16.04.

I am getting the same error. I just tried to deploy a cluster. All the masters that I started have the ‘share-mnt’ containers saying:

Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

All I ran was the docker run script generated by rancher to start the nodes.

I hit this issue as well. In my case it started when the last node of my experimental cluster stopped working after I restarted into a new version of Docker and it couldn’t get up and running again. I managed to “fix” it without deleting and creating a new cluster (as I had to several times before when I encountered the same issue).

Here’s what I did:

  1. Cleaned the node by running
docker system prune
docker volume prune

Note that this will delete all the Docker volumes, take care if you have important data in your volumes.

  1. Cleaned Rancher/Kubernetes runtime data on the node.
rm -rf /etc/cni/ /etc/kubernetes/ /opt/cni/ /var/lib/calico/ /var/lib/cni/ /var/lib/rancher/ /var/run/calico/

Note that official docs on node cleanup recommend also removal of /opt/rke and /var/lib/etcd. I did NOT remove them because they contain cluster etcd snapshots and data. This is especially important in case there’s only one node in the cluster.

  1. I exec-ed into the rancher container and hacked the cluster status (thx @ibrokethecloud for the hint):
docker exec -it rancher bash

Inside the container:

apt-get update && apt-get -y install vim
kubectl edit cluster c-XXXX  # replace the cluster-id with an actual cluster ID

Now in the editor I found the key apiEndpoint (it should be directly under the status key) and removed it. Exit the editor and container. Make sure kubectl says that it updated the cluster.

  1. From the Rancher UI I got the command for registering new node. I set a different name for the node than it was before by adding a --node-name to the docker run command (actually there’s an edit box for this under advanced settings). It looked like this:
docker run -d --privileged --restart=unless-stopped --net=host \
  -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.6 \
  --server https://rancher.example.com --token XXXXXXXXXXXXXXX --node-name mynode2 \
  --etcd --controlplane --worker
  1. I run the above command on the cleaned node and finally it registered successfully and RKE started up all the kube-* and kubelet containers.

I also tried to just clean-up the node and register it again, but it always ended up with the “Error: failed to start containers: kubelet”.

Given the above, I think Rancher doesn’t handle well the case when all the cluster nodes are become unresponsive. In this case nodes can’t even be removed from the cluster. When I tried to remove the faulty node it got stuck in the “Removing” state indefinitely, probably because cluster’s etcd couldn’t be reached.

Turns out the cluster api definition shows the apiEndpoint and rkeConfig to still contain the node that I have already deleted from the ui, hence the issue while re-adding the instance again.

Rancher seems to think this node should already have the api servier running and hence the error in the mnt-shared where its trying to start kublet.

I also got the same error,I have three machine,one for rancher server,others for node agent。 now I’ve fixed the bug,the point is the same node name.

rancher label the node from the hostname,so I login the two node agent machine, using 'hostnamectl --static set-hostname <your_defined_hostname> to different the hostname between the two machine

now it runs ok

image

the actual error comes from /usr/bin/share-root.sh

#!/bin/bash

ID=$(grep :devices: /proc/self/cgroup | head -n1 | awk -F/ '{print $NF}' | sed -e 's/docker-\(.*\)\.scope
/\1/')
IMAGE=$(docker inspect -f '{{.Config.Image}}' $ID)

docker run --privileged --net host --pid host -v /:/host --rm --entrypoint /usr/bin/share-mnt $IMAGE "$@"
 -- norun
while ! docker start kubelet; do
    sleep 2
done

I am not really sure if this is an issue with the logic, but no container named kubelet was created by agent however the agent is stuck trying to start it up.

I got the same error while creating a new cluster and add a node. And then resolve it!

Here are some info: CentOS: 7.3.1611 kernel: 3.10.0-862 docker: 18.09.9 rancher: v2.2.2 (set up by rke) rke: v0.2.4

Check steps:

  • make sure the /etc/hosts on all nodes (local-cluster’s nodes and the new node) have all record about each node. vi /etc/hosts
  • make sure each node (local-cluster’s nodes and the new node) can ssh to another without password. ssh-keygen -t rsa & ssh-copy-id root@x.x.x.x
  • make sure the system-default-registry (below Global -> Settings) is not end with ‘/’.
  • make sure the private registry (under advanced options) is not end with ‘/’.

Maybe you need to clean up the new cluster and node after failed options. Clean steps:

  • delete node on rancher UI and wait it disappear.
  • delete cluster on rancher UI and wait it disappear.
  • run some scripts shown as below and here to clean the node and reboot.
docker ps -aq | xargs docker stop
docker ps -aq | xargs docker rm -v
docker volume rm $(sudo docker volume ls -q)
mount | grep '/var/lib/kubelet'| awk '{print $3}'|xargs umount
rm -rf /var/lib/etcd \
    /var/lib/cni \
    /var/run/calico \
    /etc/kubernetes/ssl \
    /etc/kubernetes/.tmp/ \
    /opt/cni \
    /var/lib/kubelet \
    /var/lib/rancher \
    /var/lib/calico

Finally, up the new cluster and GLHF! Please let me know if anything can be optimized, thanks!

I also got the same error,I have three machine,one for rancher server,others for node agent。 now I’ve fixed the bug,the point is the same node name.

rancher label the node from the hostname,so I login the two node agent machine, using 'hostnamectl --static set-hostname <your_defined_hostname> to different the hostname between the two machine

now it runs ok

image

FYI for anyone that has the issue, the solution above worked for me. I simply re-registered the same node again but with a different hostname. As soon as the new one came up, the deletion of the old node instance completed and it dropped out. Everything came back up at that point.

in my case, problem solved by changing name. thanks!

Happened to me few moments ago, cleaning helped. See my steps in https://github.com/rancher/rancher/issues/19882#issuecomment-501056386

I encountered the same error creating my first production rancher cluster on ubuntu 16.04. The solution was to stop and restart the rancher server. (Restarting the rancher agents had no effect).

Ubuntu16.04: with cleaning the rancher install info, the probelm is sovled.

#!/bin/bash
docker rm -f $(docker ps -qa)
docker volume rm $(docker volume ls -q)
for mount in $(mount | grep tmpfs | grep '/var/lib/kubelet' | awk '{ print $3 }') /var/lib/kubelet /var/lib/rancher; do umount $mount; done
rm -rf /etc/ceph \
       /etc/cni \
       /etc/kubernetes \
       /opt/cni \
       /opt/rke \
       /run/secrets/kubernetes.io \
       /run/calico \
       /run/flannel \
       /var/lib/calico \
       /var/lib/etcd \
       /var/lib/cni \
       /var/lib/kubelet \
       /var/lib/rancher/rke/log \
       /var/log/containers \
       /var/log/pods \
       /var/run/calico

ip address show
ifconfig -a
iptables -L -t nat
iptables -L -t mangle

@Just-Insane Never got it working. From what I could tell, the tmpfs is exhausted of inodes or something, to the extent the OS stopped working. Someone with more time on their hands can maybe sort it out for the rest of us, but I just threw a small HDD in each box and installed ROS there instead. Worked fine. Ultimately gave up on iPXE because of it, unfortunately, and ROS as well, because 99% of the value of it, to me at least, was to treat nodes as extremely dumb cattle.

I just created a new local VM with Ubuntu 18.04 to test if rancher 2.2.6 is ‘stable’ now. But I can’t add the first worker with all 3 roles on my single machine setup because of this issue.

I solved removing the hostname from 127.0.0.1 on /etc/hosts and including the FQDN also. So the machine now is responding the hostname and FQDN to the eth0 not loopback.

Resuming:

192.168.1.X hostname FQDN

This is quite blocking issue also happening to me. Unfortunately with this is not realistic to maintain a production cluster in operation since at some point nodes can’t join anymore. Really sad about that though since I reckon Rancher is a good product.

Whenever you are trying to reuse nodes for Rancher server install or to add back into Rancher as a cluster, make sure you clean up per our recommendations:

https://rancher.com/docs/rancher/v2.x/en/removing-rancher/

I keep running into this issue continuously. I have tried to delete rancher, delete docker, delete prune the nodes, even reinstalled the operating system but sometimes it works and sometimes it just keeps getting stuck at


2

+ docker start kubelet

Error response from daemon: {"message":"No such container: kubelet"}

Error: failed to start containers: kubelet

Are there any permanent solutions known as this is really frustrating…

I was trying to re-use a node to test some logic and ran into the same issue again.

I have recreated the vm and the rancher ui shows cluster is waiting for etcd and controlplane nodes to be registered.

But the node never registers and keeps waiting on kubelet to start.

The rancher server on the other hand is still trying to connect to the api server on the node, even though the node has been cleanly removed and i can not find it in the /v3/nodes api endpoint either.

E0117 01:13:06.082796 5 reflector.go:205] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:144: Failed to list *v1.Event: Get https://node1:6443/api/v1/events?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect E0117 01:13:06.083813 5 reflector.go:205] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:144: Failed to list *v1.ServiceAccount: Get https://node1:6443/api/v1/serviceaccounts?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect

I assume there is something internal to rancher server where its maintaining the cluster state in etcd which has not been cleaned up when the node is removed from the cluster.

I know the easiest work around is to delete the cluster and recreate it again, as that probably cleans up the rancher etcd state.

Also interested in this discussion. Here is what led me to this thread.

My setup: I am running v1.12.3-rancher1-1 (experimental), custom cluster, self-hosted nodes. Nodes are proxmox LXC containers with apparmor disabled.

I start with a default rancher, single cluster, and creating a single node deployed as etcd + controlplane + worker. Cluster provisions fine. Then I mangle the node, delete everything according to cleanup rules (also removing more iptable rules with iptables -X; iptables -F -t nat; iptables -F -t mangle). At this point the cluster is broken. I cannot delete the node in the rancher CLI or web UI, since it looks like it is trying to tell the node to delete itself.

This cluster is currently Unavailable; areas that interact directly with it will not be available until the API is ready.

Failed to communicate with API server: Get https://10.0.0.53:6443/api/v1/componentstatuses?timeout=30s: dial tcp 127.0.0.1:6443: connect: connection refused

As others mentioned above, modifying the hostname of the node works. After modifying the hostname and redeploying the agent, the “new” node comes up fine and the old node is removed, as long as you have marked the “old” node to be deleted. If not then rancher throws an error about two nodes with the same ip.

A couple of my questions:

  1. It looks like the rancher CLI is using the same API endpoints as the web ui, so there doesn’t seem to be a way to manually fix this hung node.
  2. Any chance of additional CLI options to the rancher-agent for recovering the node?
  3. I’ve seen loss of workloads in this case; I would have thought the etcd would be recovered from the rancher node.

Same for me. Ubuntu 16.04, Docker 17.03.2-ce I have 2 host - rancher + creating cluster with 2 machines and get same error. Logs from docker shows dockerd[2710]: time=“2018-07-05T13:59:37.185566233Z” level=error msg=“Handler for POST /v1.24/containers/kubelet/start returned error: No such container: kubelet”

+1 for seeing the same thing on a brand new Rancher 2.0 deployment with a new set of CentOS7 nodes

I’m not sure if bare metal or VM matters. I think I got similar errors when trying to deploy it on bare metal (my laptop)