rancher: Error: failed to start containers: kubelet
Rancher versions: rancher/server or rancher/rancher: rancher/rancher:latest@sha256:38839bb19bdcac084a413a4edce7efb97ab99b6d896bda2f433dfacfd27f8770 rancher/agent or rancher/rancher-agent: rancher/rancher-agent:v2.0.0 Infrastructure Stack versions: whatever the defaults are
Docker version: (docker version
,docker info
preferred)
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.8.3
Git commit: f5ec1e2-snap-345b814
Built: Thu Jun 29 23:40:29 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.8.3
Git commit: f5ec1e2-snap-345b814
Built: Thu Jun 29 23:40:29 2017
OS/Arch: linux/amd64
Experimental: false
Operating system and kernel: (cat /etc/os-release
, uname -r
preferred)
4.15.0-20-generic
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
ssdnodes/custom cluster
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
single node
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
can’t set up node
Steps to Reproduce:
snap install docker --channel=17.03/stable
mkdir /etc/kubernetes
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.0.0 --server https://rancher.sixcorners.info --token abc --ca-checksum xyz --worker
docker logs -f share-mnt
Results:
Found state.json: 931882e24ff0ef67b0e8744dbf1f7e04fd68afe714a29a2522293312824f3c51
time=“2018-05-06T06:09:15Z” level=info msg=“Execing [/usr/bin/nsenter --mount=/proc/21787/ns/mnt -F – /var/snap/docker/common/var-lib-docker/aufs/mnt/5d00bd40adec6662aaec8ea2a5f5ce6a332e9dbfad087a008c5c89b7cac4c22f/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher – norun]”
Error response from daemon: {“message”:“No such container: kubelet”}
Error: failed to start containers: kubelet
Error response from daemon: {“message”:“No such container: kubelet”}
Error: failed to start containers: kubelet
Error response from daemon: {“message”:“No such container: kubelet”}
Error: failed to start containers: kubelet
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 24
- Comments: 91 (11 by maintainers)
I have seen this bug many times in many different places/circumstances. The only way I’ve found to fix it is by destroying the cluster and recreating it.
Hi, I also had this error because my VMs had the same name. After changing the name and cleaning entire VM (volume prune, system prune etc.) everything went back to normal.
I am seeing the same issue when attempting to re-register a node. The steps I use are:
In this case the rancher server reports:
2018/06/27 14:45:35 [ERROR] ClusterController c-v2998 [cluster-agent-controller] failed with : could not contact server: Get https://x.x.x.x:6443/version: dial tcp 127.0.0.1:6443: getsockopt: connection refused
The node has only 2 containers:
share-mnt: Error response from daemon: {“message”:“No such container: kubelet”} Error: failed to start containers: kubelet quirky_hopper: … time=“2018-06-27T14:57:46Z” level=info msg=“waiting for node to register” time=“2018-06-27T14:57:48Z” level=info msg=“Starting plan monitor”
The only way to proceed at this point is stop all activity on the node, delete the node from the rancher ui, and recreate the cluster and re register.
Running server 2.0.4 with only the one node, Docker version 17.03.2-ce on Ubuntu 16.04.
I am getting the same error. I just tried to deploy a cluster. All the masters that I started have the ‘share-mnt’ containers saying:
All I ran was the docker run script generated by rancher to start the nodes.
I hit this issue as well. In my case it started when the last node of my experimental cluster stopped working after I restarted into a new version of Docker and it couldn’t get up and running again. I managed to “fix” it without deleting and creating a new cluster (as I had to several times before when I encountered the same issue).
Here’s what I did:
Note that this will delete all the Docker volumes, take care if you have important data in your volumes.
Note that official docs on node cleanup recommend also removal of
/opt/rke
and/var/lib/etcd
. I did NOT remove them because they contain cluster etcd snapshots and data. This is especially important in case there’s only one node in the cluster.exec
-ed into the rancher container and hacked the cluster status (thx @ibrokethecloud for the hint):Inside the container:
Now in the editor I found the key
apiEndpoint
(it should be directly under thestatus
key) and removed it. Exit the editor and container. Make sure kubectl says that it updated the cluster.--node-name
to the docker run command (actually there’s an edit box for this under advanced settings). It looked like this:kube-*
andkubelet
containers.I also tried to just clean-up the node and register it again, but it always ended up with the “Error: failed to start containers: kubelet”.
Given the above, I think Rancher doesn’t handle well the case when all the cluster nodes are become unresponsive. In this case nodes can’t even be removed from the cluster. When I tried to remove the faulty node it got stuck in the “Removing” state indefinitely, probably because cluster’s etcd couldn’t be reached.
Turns out the cluster api definition shows the apiEndpoint and rkeConfig to still contain the node that I have already deleted from the ui, hence the issue while re-adding the instance again.
Rancher seems to think this node should already have the api servier running and hence the error in the mnt-shared where its trying to start kublet.
I also got the same error,I have three machine,one for rancher server,others for node agent。 now I’ve fixed the bug,the point is the same node name.
rancher label the node from the hostname,so I login the two node agent machine, using 'hostnamectl --static set-hostname <your_defined_hostname> to different the hostname between the two machine
now it runs ok
the actual error comes from /usr/bin/share-root.sh
I am not really sure if this is an issue with the logic, but no container named
kubelet
was created by agent however the agent is stuck trying to start it up.I got the same error while creating a new cluster and add a node. And then resolve it!
Here are some info: CentOS: 7.3.1611 kernel: 3.10.0-862 docker: 18.09.9 rancher: v2.2.2 (set up by rke) rke: v0.2.4
Check steps:
/etc/hosts
on all nodes (local-cluster’s nodes and the new node) have all record about each node.vi /etc/hosts
ssh-keygen -t rsa
&ssh-copy-id root@x.x.x.x
system-default-registry
(below Global -> Settings) is not end with ‘/’.private registry
(under advanced options) is not end with ‘/’.Maybe you need to clean up the new cluster and node after failed options. Clean steps:
Finally, up the new cluster and GLHF! Please let me know if anything can be optimized, thanks!
FYI for anyone that has the issue, the solution above worked for me. I simply re-registered the same node again but with a different hostname. As soon as the new one came up, the deletion of the old node instance completed and it dropped out. Everything came back up at that point.
in my case, problem solved by changing name. thanks!
Happened to me few moments ago, cleaning helped. See my steps in https://github.com/rancher/rancher/issues/19882#issuecomment-501056386
I encountered the same error creating my first production rancher cluster on ubuntu 16.04. The solution was to stop and restart the rancher server. (Restarting the rancher agents had no effect).
Ubuntu16.04: with cleaning the rancher install info, the probelm is sovled.
@Just-Insane Never got it working. From what I could tell, the tmpfs is exhausted of inodes or something, to the extent the OS stopped working. Someone with more time on their hands can maybe sort it out for the rest of us, but I just threw a small HDD in each box and installed ROS there instead. Worked fine. Ultimately gave up on iPXE because of it, unfortunately, and ROS as well, because 99% of the value of it, to me at least, was to treat nodes as extremely dumb cattle.
I just created a new local VM with Ubuntu 18.04 to test if rancher 2.2.6 is ‘stable’ now. But I can’t add the first worker with all 3 roles on my single machine setup because of this issue.
I solved removing the hostname from 127.0.0.1 on /etc/hosts and including the FQDN also. So the machine now is responding the hostname and FQDN to the eth0 not loopback.
Resuming:
192.168.1.X hostname FQDN
This is quite blocking issue also happening to me. Unfortunately with this is not realistic to maintain a production cluster in operation since at some point nodes can’t join anymore. Really sad about that though since I reckon Rancher is a good product.
Whenever you are trying to reuse nodes for Rancher server install or to add back into Rancher as a cluster, make sure you clean up per our recommendations:
https://rancher.com/docs/rancher/v2.x/en/removing-rancher/
I keep running into this issue continuously. I have tried to delete rancher, delete docker, delete prune the nodes, even reinstalled the operating system but sometimes it works and sometimes it just keeps getting stuck at
Are there any permanent solutions known as this is really frustrating…
I was trying to re-use a node to test some logic and ran into the same issue again.
I have recreated the vm and the rancher ui shows cluster is waiting for etcd and controlplane nodes to be registered.
But the node never registers and keeps waiting on kubelet to start.
The rancher server on the other hand is still trying to connect to the api server on the node, even though the node has been cleanly removed and i can not find it in the /v3/nodes api endpoint either.
E0117 01:13:06.082796 5 reflector.go:205] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:144: Failed to list *v1.Event: Get https://node1:6443/api/v1/events?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect E0117 01:13:06.083813 5 reflector.go:205] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:144: Failed to list *v1.ServiceAccount: Get https://node1:6443/api/v1/serviceaccounts?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
I assume there is something internal to rancher server where its maintaining the cluster state in etcd which has not been cleaned up when the node is removed from the cluster.
I know the easiest work around is to delete the cluster and recreate it again, as that probably cleans up the rancher etcd state.
Also interested in this discussion. Here is what led me to this thread.
My setup: I am running
v1.12.3-rancher1-1 (experimental)
, custom cluster, self-hosted nodes. Nodes are proxmox LXC containers with apparmor disabled.I start with a default rancher, single cluster, and creating a single node deployed as etcd + controlplane + worker. Cluster provisions fine. Then I mangle the node, delete everything according to cleanup rules (also removing more iptable rules with
iptables -X; iptables -F -t nat; iptables -F -t mangle
). At this point the cluster is broken. I cannot delete the node in the rancher CLI or web UI, since it looks like it is trying to tell the node to delete itself.As others mentioned above, modifying the hostname of the node works. After modifying the hostname and redeploying the agent, the “new” node comes up fine and the old node is removed, as long as you have marked the “old” node to be deleted. If not then rancher throws an error about two nodes with the same ip.
A couple of my questions:
Same for me. Ubuntu 16.04, Docker 17.03.2-ce I have 2 host - rancher + creating cluster with 2 machines and get same error. Logs from docker shows dockerd[2710]: time=“2018-07-05T13:59:37.185566233Z” level=error msg=“Handler for POST /v1.24/containers/kubelet/start returned error: No such container: kubelet”
+1 for seeing the same thing on a brand new Rancher 2.0 deployment with a new set of CentOS7 nodes
I’m not sure if bare metal or VM matters. I think I got similar errors when trying to deploy it on bare metal (my laptop)