rancher: New nodes cannot join cluster: Operation cannot be fulfilled on clusters.management.cattle.io "c-xxx": the object has been modified

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

On an existing custom cluster add a worker node to the cluster with the supplied docker run command

Other reproduction steps are currently unknown

Result:

Errors are observed in the rancher/agent container

2019-12-05T03:17:13.749500476Z time="2019-12-05T03:17:13Z" level=error msg="Failed to connect to proxy. Response status: 200 - 200 OK. Response body: Operation cannot be fulfilled on nodes.management.cattle.io \"m-xxx\": the object has been modified; please apply your changes to the latest version and try again" error="websocket: bad handshake"

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.3.0
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom

  • Machine type (cloud/VM/metal) and specifications (CPU/memory): VM/metal

  • Kubernetes version (use kubectl version): 1.13.5 (Rancher), 1.16.1 (downstream)

  • Docker version (use docker version): 18.09.9

gzrancher/rancher#10231

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 16
  • Comments: 43 (6 by maintainers)

Most upvoted comments

My issue is identical to @Ethyling, except I’m using Terraform and destroying the nodes and creating fresh ones each time. I get the proxy/websocket error and then the “No such container: kubelet” error in a loop. This is completely random with no rhyme or reason to it. I can run my Terraform deployment 10 times in a row and sometimes all nodes will register each time. I can run it another 10 times and have nodes fail to register on each run.

Facing the same issue on v.2.4.1 with a custom cluster using Rancher2 Terraform provider.

Initially, I can successfully create a new cluster with one master and multiple worker nodes running the docker join command on each node. When increasing the number of worker nodes and reapplying the terraform manifest the new worker nodes can’t join the cluster:

time="2020-04-09T09:53:00Z" level=error msg="Failed to connect to proxy. Response status: 200 - 200 OK. Response body: Operation cannot be fulfilled on nodes.management.cattle.io \"m-bace3de96714\": the object has been modified; please apply your changes to the latest version and try again" error="websocket: bad handshake"

We have a fix in 2.4 but since this is not reproducible easily, we want to keep it open to see if people are still hitting this post 2.4.

So I may have found a way to make it work. To give more context, I am using Ansible to manage my nodes and Terraform to manage Rancher. While I am creating a cluster, I am running an Ansible role that runs Terraform, get the docker command to register the nodes, run it on each node, and then use Terraform again once the cluster is ready to deploy apps like an ingress controller. The effect of that is that my nodes are all registered at the exact same time (or at least really close). Now I am registering my node one by one (with about 15-20s between each) and it seems to be working. I can’t confirm that it is working or if I have a lot of chance, but it worked 3 times in a row. Maybe that can help you to find something @StrongMonkey?

Hi,

I have an issue similar of this one. When I delete a cluster and try to create a new one using the same nodes, some nodes failed to join the cluster. I follow the official documentation regarding the cleanup of the nodes before recreating the cluster. On the failing nodes, the rancher-agent container start and fail to register the node with the following message:

time="2020-06-11T15:42:29Z" level=error msg="Failed to connect to proxy. Response status: 200 - 200 OK. Response body: Operation cannot be fulfilled on nodes.management.cattle.io \"m-2a05ebe1536e\": the object has been modified; please apply your changes to the latest version and try again" error="websocket: bad handshake"

About one minute later, another container is created and try to run the share-root.sh script and this container will try to start the kubelet without any success:

+ docker start kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

Another rancher-agent is created, and this container seems to be doing nothing.

I don’t know how to solve this issue and where to find more logs or hints of this problem. Can we have some help about this? Thank you

@Vacant0mens when removing a node, is it being removed from the UI and rebooted before being re-added?

If it is removed because it’s unreachable, could I suggest performing a manual clean up from the link above, you can also use this script

I discovered via a thread on the Rancher Slack that my issue was a result of not registering nodes of all of the types (etcd, controlplane and worker). I was initally trying just etcd and controlplane, but you need a worker also for provisioning to commence at all. Maybe that helps someone else.

@mcmcghee I tried by running docker system prune -a -f while cleaning up the node, and it seems to be still failing 😕

I am using Terraform Rancher provider too, but I am registering my nodes by running the docker run command, not using Terraform to register the nodes. I can observe that when a node join the cluster without issue and another one is failing to join, if I destroy the cluster, clean the nodes, and resetup the cluster with the same nodes, the one that was failing will join without issue and the other one that was succeeding will fail.

EDIT: I tried to add a random ID to my nodes using the --node-name option but it does not solve the issue. I don’t know if it could help.

I also recently encountered this on v2.4.2. Even after having the cluster up for just 1 hour my audit-log.json on master is >105Mb

@Ethyling @StrongMonkey To give you background on how I do it, I create my VM templates w/ packer and pre-pull the container images for the Rancher version I’m deploying. So I have a VM template for Rancher v2.4.4, v2.4.5, and so on. What I’m seeing is that if the container images exist (pre-pulled) on the node before I run the join command then I get failures. If the images do not exist, and the agent has to fetch them, I don’t get any failures.

These are the tests I have run. Each test I ran multiple times due to the randomness of this bug.

  • If I deploy a cluster using a template with matching pre-pulled images, I run into this issue immediately and consistently. (e.g. template w/ v2.4.5 images and deploying Rancher v2.4.5)
  • If I run the exact same deployment as above, same template and Rancher version, but run docker system prune -a -f before the join command, I don’t have the issue.
  • If I deploy a cluster using a template without matching pre-pulled images, I don’t have the issue. (e.g. template w/ v2.4.4 images and deploying Rancher v2.4.5)
  • If I deploy a cluster using a template without any pre-pulled images, I don’t have the issue.

@Ethyling I know you’re cleaning your nodes already between runs, are you running docker system prune -a -f as well?

I think there could be a few possible reasons for what I’m seeing:

  1. When the images are pre-pulled, maybe the Rancher server is getting overloaded with the number of join requests hitting all at the same time? When the images are not pre-pulled, maybe the Rancher server does not get overloaded as the requests are more staggered due to each node having to fetch the images.
  2. The Rancher agent doesn’t always handle the fact that the images are pre-pulled.
  3. None of the above, it’s still random and all of my tests are just coincidences.