k3s: Server doesn't start after upgrade to v1.21.7+k3s1 (ac705709): bootstrap data already found and encrypted with different token

Environmental Info: K3s Version: v1.21.7+k3s1 (ac705709)

Node(s) CPU architecture, OS, and Version:

5.15.4-201.fc35.x86_64 #1 SMP Tue Nov 23 18:54:50 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Fedora Linux 35

Cluster Configuration:

I have 2 servers using external PostgreSQL database and couple of agents. There’s system-upgrade-controller configured to follow channel: https://update.k3s.io/v1-release/channels/stable

Describe the bug:

Today auto upgrade from v1.21.5+k3s2 to v1.21.7+k3s1 happened on on server. It stopped responding. Service logs shows following failure:

k3s[1644879]: time="2021-12-04T11:17:49.379252883+01:00" level=info msg="Starting k3s v1.21.7+k3s1 (ac705709)"
k3s[1644879]: time="2021-12-04T11:17:49.420438206+01:00" level=info msg="Configuring postgres database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
k3s[1644879]: time="2021-12-04T11:17:49.420639983+01:00" level=info msg="Configuring database table schema and indexes, this may take a moment..."
k3s[1644879]: time="2021-12-04T11:17:49.423536941+01:00" level=info msg="Database tables and indexes are up to date"
k3s[1644879]: time="2021-12-04T11:17:49.448156983+01:00" level=info msg="Kine listening on unix://kine.sock"
k3s[1644879]: time="2021-12-04T11:17:49.480807423+01:00" level=fatal msg="starting kubernetes: preparing server: bootstrap data already found and encrypted with different token"
systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE

I have verified /var/lib/rancher/k3s/server/token is the same on both servers. After downgrading k3s binary back to v1.21.5+k3s2, server is able to start again.

Steps To Reproduce:

Installed K3s: Cluster installed 342 days ago using: export INSTALL_K3S_COMMIT=fadc5a8057c244df11757cd47cc50cc4a4cf5887 export K3S_DATASTORE_ENDPOINT=‘postgres://[…]’ ./k3s-install server
–tls-san api.chi.pipebreaker.pl
–disable traefik
–flannel-backend=wireguard

Then it was auto-upgraded using system-upgrade-controller, stable channel, up to v1.21.5+k3s2.

Expected behavior:

Server should start after upgrade to v1.21.7+k3s1

Actual behavior:

Startup failed with level=fatal msg="starting kubernetes: preparing server: bootstrap data already found and encrypted with different token"

Does handling the bootstrap data encryption changed between 1.21.5+k3s2 and 1.21.7+k3s1?

Additional context / logs:

There are corresponding logs from Postgresql:

2021-12-04 11:05:52.461 CET [2243563] ERROR:  relation "key_value" does not exist at character 22
2021-12-04 11:05:52.461 CET [2243563] STATEMENT:  SELECT COUNT(*) FROM key_value
2021-12-04 11:05:52.511 CET [2243567] LOG:  could not receive data from client: Connection reset by peer
2021-12-04 11:05:52.511 CET [2243566] LOG:  could not send data to client: Connection reset by peer
2021-12-04 11:05:52.511 CET [2243566] STATEMENT:  
                                SELECT (
                        SELECT MAX(rkv.id) AS id
                        FROM kine AS rkv), (
                        SELECT MAX(crkv.prev_revision) AS prev_revision
                        FROM kine AS crkv
                        WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value
                                FROM kine AS kv
                                WHERE
                                        kv.name LIKE $1 AND
                                        kv.id > $2
                                ORDER BY kv.id ASC LIMIT 500
2021-12-04 11:05:52.511 CET [2243565] LOG:  could not send data to client: Connection reset by peer

2021-12-04 11:05:52.511 CET [2243565] STATEMENT:  
                        SELECT (
                        SELECT MAX(rkv.id) AS id
                        FROM kine AS rkv), (
                        SELECT MAX(crkv.prev_revision) AS prev_revision
                        FROM kine AS crkv
                        WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value
                        FROM kine AS kv
                        JOIN (
                                SELECT MAX(mkv.id) AS id
                                FROM kine AS mkv
                                WHERE
                                        mkv.name LIKE $1
                                        
                                GROUP BY mkv.name) maxkv
                    ON maxkv.id = kv.id
                        WHERE
                                  (kv.deleted = 0 OR $2)
                        ORDER BY kv.id ASC
                         LIMIT 1000
2021-12-04 11:05:52.511 CET [2243566] FATAL:  connection to client lost

Backporting

Needs backporting to older releases

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 28 (16 by maintainers)

Most upvoted comments

We haven’t changed the token hash calculation since v0.11.0-alpha3.

You mentioned above that you did not specify the token when initially creating the cluster; this means that each node had a different value in /var/lib/rancher/k3s/server/token when the cluster was upgraded; I believe you need to take manual action to find the node that was upgraded first (the one whose token hash matches the bootstrap hash) and set that same token value on the other nodes.

brandond on Dec 8, 2021