k3s: Can't reboot/recreate master nodes due the existing passwd file

Environmental Info: K3s Version: 1.22.5+k3s-1

Node(s) CPU architecture, OS, and Version: 2 x amd64 Ubuntu 20.04.3

Kernel: 5.4.0-92-generic

Cluster Configuration: 2 master nodes with MySQL backend

Describe the bug: After upgrade the cluster from 1.21.4 to 1.21.8 and finally to 1.22.5 the cluster is healthy and running with both nodes. But after reboot a node, k3s won’t start anymore and I got the error:

Jan 11 11:18:33 raseed-test-server-1 k3s[13039]: time="2022-01-11T11:18:33Z" level=fatal msg="/var/lib/rancher/k3s/server/cred/passwd newer than datastore and could cause cluster outage. Remove the file from disk and restart to be recreated from datastore." 

the file exists on the disc:

# ls -l /var/lib/rancher/k3s/server/cred/passwd
-rw------- 1 root root 111 Jan 10 13:59 /var/lib/rancher/k3s/server/cred/passwd

For sure, I can remove the file and k3s will start, but it seems not very handy in things of automatation.

How can I prevend the error and what are the corresponding database entry which is compared in that function?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 41 (20 by maintainers)

Most upvoted comments

@brandond it seems

delete from kine where name like '/bootstrap%';

solves the problem. Boostrap data will re-created after k3s start and stay tuned also after restart.

I did the same (deleted bootstrap data) on my sqlite database and everything seems to work now even after 1.21.9 upgrade

just to summarize:

  • in the Bootstrap data of the DB is a token
  • this data should be encrypted with the same token, for some reasons in our environments this data are encrypted with another token
  • bootstrap nodes and re-join cluster works with the newer token, because the Boostrap data can be decrypted (like in k3s-dump-bootstrap)
  • the new token are written to the disc
  • bootstrapping with the origin token failed with starting kubernetes: preparing server: bootstrap data already found and encrypted with different token
  • restart k3s failed due the different content of the passwd file on disk and the token in the Bootstrap data in DB

So, it seems we need functionality to update Bootstrap data in DB with the newer token, as an emergency tool or regular operations.

I reviewed the deploy history of this cluster and since v1.21.3+k3s1 with introducing the token value we have almost the same value.

Almost the same unfortunately isn’t good enough. The token value cannot currently be changed once the cluster is created - it must be set initially and remain set to the same value across all servers in the cluster.

Is this content in the DB maybe auto-generated?

If a token is not specified in the configuration, a token is randomly generated and written to disk. I think that what probably happened is that the token was changed at some point, but until we added bootstrap data reconciliation, it didn’t matter since the data was already on disk. Now that we attempt to reconcile, the change is causing problems.

An option is to extend your dump app with an encrypt function and update the entry in the DB with the new token

You may be right; resolving this may require creating a tool, or adding functionality to K3s itself, to allow authoritatively setting the token and triggering a reset of the bootstrap data using the current on-disk state of a server.

Hmm, are you sure they’re exactly the same? The file timestamps are only checked if the content is different, so if they are in fact the same it shouldn’t care what the timestamp is.

https://github.com/k3s-io/k3s/blob/f662a7f45b95a2bbf2d55af63761e62d4e2731c0/pkg/cluster/bootstrap.go#L470-L477

Yeah, that indicates that the data in the database hasn’t been migrated to the newer format that includes the timestamp yet. Can you confirm that the contents of /var/lib/rancher/k3s/server/cred/passwd on both nodes is the same, and matches the contents of the PasswdFile entry in your datastore?

Is there anything unique to your environment that would cause this file to be modified? This is raising an error because it’s not expected that it would be out of sync after a restart.

cc @briandowns