k3s: starting kubernetes: preparing server: json: cannot unmarshal string into Go value of type bootstrap.File

Environmental Info: K3s Version: v1.22.4+k3s1 (bec170bc)

Node(s) CPU architecture, OS, and Version: Linux pi4-rack-1.local 5.10.82-v8+ #1497 SMP PREEMPT Fri Dec 3 16:30:35 GMT 2021 aarch64 GNU/Linux

Cluster Configuration: 2 Servers, 1 Agent

Describe the bug: After upgrading from Buster to Bullseye and get a cgroup error. Then upgrades k3s from 1.18.x to v1.22.4+k3s1. After starting k3s I get starting kubernetes: preparing server: json: cannot unmarshal string into Go value of type bootstrap.File

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: xxx
    server: https://127.0.0.1:6443
  name: default
contexts:
- context:
    cluster: default
    user: default
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
  user:
    password: xxx
    username: xxx

Steps To Reproduce:

Install Buster
Install k3s in Version 1.18.x
Upgrade Buster to Bullseye
Upgrade k3s to Version v1.22.4+k3s1

Expected behavior: Server should started 😃

Actual behavior: Server not started 😃

Additional context / logs:

Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Starting k3s v1.22.4+k3s1 (bec170bc)"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Database tables and indexes are up to date"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Kine available at unix://kine.sock"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Reconciling bootstrap data between datastore and disk"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=fatal msg="starting kubernetes: preparing server: json: cannot unmarshal string into Go value of type bootstrap.File"
Dez 06 11:39:12 pi4-rack-1.local systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Dez 06 11:39:12 pi4-rack-1.local systemd[1]: k3s.service: Failed with result 'exit-code'.
Dez 06 11:39:12 pi4-rack-1.local systemd[1]: Failed to start Lightweight Kubernetes.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 37 (21 by maintainers)

Most upvoted comments

@Oats87 Yes, we started on the init node.

I’m away from the system right now. I will provide the information regarding --cluster-init tomorrow because I’m not 100% sure off the top of my head.

knweiss on Dec 13, 2021

@ohlol if you are replacing your nodes with new ones using an ASG, then the fix @briandowns has in #4730 will fix your issue most likely.

I’m now attempting to hone in on what issue is being hit on an in-place upgrade

Oats87 on Dec 13, 2021

@brandond: I’m working with @knweiss and we have a three server HA setup with embedded etcd. The scenario is, to do a rolling update. But when trying to lift one of the serves to v1.21.7+k3s1, the error occurs. From my understanding the server tries to rejoin the cluster after the upgrade which fails due to the described problem. At this stage we can only get the server back into running state by downgrading the affected server to match the version of the other two.

ffly90 on Dec 13, 2021

FWIW: We saw the same error today upgrading our three node k3s (stable channel; embedded etcd) from v1.21.5+k3s2 to v1.21.7+k3s1.

knweiss on Dec 10, 2021

Hi @brandond thanks for your answer. I thought is was ok to jump forward 4 minor version. In SemVer minor versions should not break anything (imho).

larmic on Dec 6, 2021

Thanks for the update @knweiss.

ShylajaDevadiga on Dec 16, 2021

@Oats87 Today, we repeated the stable channel upgrade from v1.21.5+k3s2 to v1.21.7+k3s1 and much to our surprise this time it succeeded on all three nodes (with active --node-external-ip on all three nodes). Unfortunately, we don’t know what’s different this time. 😕 (We may did the last test with --disable servicelb in common-options but we’re not 100% sure anymore.)

Regarding the --node-external-ip $VIP: The idea is to use the kube-vip VIP as a LoadBalancer IP for Traefik and not only to access the control plane. We have a wildcard DNS *.domain.local that points to this VIP. Traefik is the IngressController for our services and cert-manager provides TLS certs for all DNS names (e.g. svc1.domain.local).

In the default k3s configuration with three k3s servers (also used as workers) Traefik will use the three node-local IP addresses as Traefik’s LoadBalancer IPs. This works. However, if DNS resolution for external service names points to one of those three node-local IPs the service would not be available during maintenance of this (server) node. To prevent this situation we came up with the --node-external-ip $VIP solution. Do you think this is a bad idea?

The --node-external-ip is a very recent change in our setup (we did not do much testing yet). The only issue we noticed so far is that the helm-install-traefik*pods had problems starting while the VIP was not on their node.
NAME   STATUS   ROLES                       AGE   VERSION        INTERNAL-IP   EXTERNAL-IP     OS-IMAGE                           KERNEL-VERSION                CONTAINER-RUNTIME
node0   Ready    control-plane,etcd,master   47d   v1.21.7+k3s1   x.y.142.241   x.y.142.232   Rocky Linux 8.5 (Green Obsidian)   4.18.0-348.2.1.el8_5.x86_64   containerd://1.4.12-k3s1
node1   Ready    control-plane,etcd,master   47d   v1.21.7+k3s1   x.y.142.240   x.y.142.232   Rocky Linux 8.5 (Green Obsidian)   4.18.0-348.2.1.el8_5.x86_64   containerd://1.4.12-k3s1
node2   Ready    control-plane,etcd,master   48d   v1.21.7+k3s1   x.y.142.239   x.y.142.232   Rocky Linux 8.5 (Green Obsidian)   4.18.0-348.2.1.el8_5.x86_64   containerd://1.4.12-k3s1
(Notice, the external IP (VIP) is shown on all three (server) nodes but is only active on one at a time.)

I think you may be trying to do things a bit backwards here. Setting --node-external-ip to the same value for all of your nodes has big implications as that affects core K8s/K3s behavior and can lead to unexpected behavior.

What I would recommend in this case is to disable servicelb and then deploy kube-vip configured to fulfill services with type LoadBalancer.

Regardless, if you hit this issue again (or have any desired clarifications on this), please open a new issue and be sure to mention me on it – I will let the K3s QA close out this issue when they finish validation of the edge case we identified above /cc @k3s-io/k3s-testing

Oats87 on Dec 16, 2021

@Oats87 Today, we repeated the stable channel upgrade from v1.21.5+k3s2 to v1.21.7+k3s1 and much to our surprise this time it succeeded on all three nodes (with active --node-external-ip on all three nodes). Unfortunately, we don’t know what’s different this time. 😕 (We may did the last test with --disable servicelb in common-options but we’re not 100% sure anymore.)

Regarding the --node-external-ip $VIP: The idea is to use the kube-vip VIP as a LoadBalancer IP for Traefik and not only to access the control plane. We have a wildcard DNS *.domain.local that points to this VIP. Traefik is the IngressController for our services and cert-manager provides TLS certs for all DNS names (e.g. svc1.domain.local).

In the default k3s configuration with three k3s servers (also used as workers) Traefik will use the three node-local IP addresses as Traefik’s LoadBalancer IPs. This works. However, if DNS resolution for external service names points to one of those three node-local IPs the service would not be available during maintenance of this (server) node. To prevent this situation we came up with the --node-external-ip $VIP solution. Do you think this is a bad idea?

The --node-external-ip is a very recent change in our setup (we did not do much testing yet). The only issue we noticed so far is that the helm-install-traefik*pods had problems starting while the VIP was not on their node.

NAME   STATUS   ROLES                       AGE   VERSION        INTERNAL-IP   EXTERNAL-IP     OS-IMAGE                           KERNEL-VERSION                CONTAINER-RUNTIME
node0   Ready    control-plane,etcd,master   47d   v1.21.7+k3s1   x.y.142.241   x.y.142.232   Rocky Linux 8.5 (Green Obsidian)   4.18.0-348.2.1.el8_5.x86_64   containerd://1.4.12-k3s1
node1   Ready    control-plane,etcd,master   47d   v1.21.7+k3s1   x.y.142.240   x.y.142.232   Rocky Linux 8.5 (Green Obsidian)   4.18.0-348.2.1.el8_5.x86_64   containerd://1.4.12-k3s1
node2   Ready    control-plane,etcd,master   48d   v1.21.7+k3s1   x.y.142.239   x.y.142.232   Rocky Linux 8.5 (Green Obsidian)   4.18.0-348.2.1.el8_5.x86_64   containerd://1.4.12-k3s1

(Notice, the external IP (VIP) is shown on all three (server) nodes but is only active on one at a time.)

knweiss on Dec 16, 2021

@Oats87

Great, thank you. I want to some more testing with this. Out of curiosity, what are you running on your other controlplane/server nodes?

The other server nodes would be started like this (as mentioned we did not reach this point with v1.21.7+k3s1 as the init node did not start successfully because of the unmarshaling issue):

common_options="--etcd-snapshot-retention 10 --selinux"
INSTALL_K3S_EXEC="--server=https://INITNODE:6443 $common_options --node-external-ip $VIP"
curl -sfL https://get.k3s.io | \
  INSTALL_K3S_CHANNEL="stable" \
  K3S_TOKEN="$K3S_TOKEN" \
  INSTALL_K3S_EXEC="$INSTALL_K3S_EXEC" sh -

(Gonna move --node-external-ip into the common_options variable…)

knweiss on Dec 15, 2021

MySQL for HA

myoung34 on Dec 14, 2021

I’ll have detailed steps describing my process and observations later today, but briefly:

architectural context/details:

RKE2 leaders are in an Auto Scaling Group
RKE2 is installed w/Packer, so an upgrade is done by building a new AMI with the target version of RKE2, which is used to update the ASG’s Launch Template
I don’t upgrade live systems–they get replaced

I take a snapshot (to S3) of etcd state
Scale the ASG to zero, update Launch Template, scale ASG to 1
SSH to new leader, stop rke2-server, and perform upgrade as described in snapshot backup/restore doc

As an aside, prior to 1.21.7 I was able to do this by just manually cycling ASG instances one by one.

Anyway, hope that helps clarify my process at least, for what it’s worth. I’ll get some actual log output etc later today.

ohlol on Dec 13, 2021