k3s: starting kubernetes: preparing server: json: cannot unmarshal string into Go value of type bootstrap.File
Environmental Info: K3s Version: v1.22.4+k3s1 (bec170bc)
Node(s) CPU architecture, OS, and Version: Linux pi4-rack-1.local 5.10.82-v8+ #1497 SMP PREEMPT Fri Dec 3 16:30:35 GMT 2021 aarch64 GNU/Linux
Cluster Configuration: 2 Servers, 1 Agent
Describe the bug:
After upgrading from Buster to Bullseye and get a cgroup error.
Then upgrades k3s from 1.18.x to v1.22.4+k3s1.
After starting k3s I get
starting kubernetes: preparing server: json: cannot unmarshal string into Go value of type bootstrap.File
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: xxx
server: https://127.0.0.1:6443
name: default
contexts:
- context:
cluster: default
user: default
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
user:
password: xxx
username: xxx
Steps To Reproduce:
- Install Buster
- Install k3s in Version 1.18.x
- Upgrade Buster to Bullseye
- Upgrade k3s to Version v1.22.4+k3s1
Expected behavior: Server should started 😃
Actual behavior: Server not started 😃
Additional context / logs:
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Starting k3s v1.22.4+k3s1 (bec170bc)"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Database tables and indexes are up to date"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Kine available at unix://kine.sock"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=info msg="Reconciling bootstrap data between datastore and disk"
Dez 06 11:39:12 pi4-rack-1.local k3s[15556]: time="2021-12-06T11:39:12Z" level=fatal msg="starting kubernetes: preparing server: json: cannot unmarshal string into Go value of type bootstrap.File"
Dez 06 11:39:12 pi4-rack-1.local systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Dez 06 11:39:12 pi4-rack-1.local systemd[1]: k3s.service: Failed with result 'exit-code'.
Dez 06 11:39:12 pi4-rack-1.local systemd[1]: Failed to start Lightweight Kubernetes.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 37 (21 by maintainers)
@Oats87 Yes, we started on the init node.
I’m away from the system right now. I will provide the information regarding
--cluster-inittomorrow because I’m not 100% sure off the top of my head.@ohlol if you are replacing your nodes with new ones using an ASG, then the fix @briandowns has in #4730 will fix your issue most likely.
I’m now attempting to hone in on what issue is being hit on an in-place upgrade
@brandond: I’m working with @knweiss and we have a three server HA setup with embedded etcd. The scenario is, to do a rolling update. But when trying to lift one of the serves to v1.21.7+k3s1, the error occurs. From my understanding the server tries to rejoin the cluster after the upgrade which fails due to the described problem. At this stage we can only get the server back into running state by downgrading the affected server to match the version of the other two.
FWIW: We saw the same error today upgrading our three node k3s (stable channel; embedded etcd) from
v1.21.5+k3s2tov1.21.7+k3s1.Hi @brandond thanks for your answer. I thought is was ok to jump forward 4 minor version. In SemVer minor versions should not break anything (imho).
Thanks for the update @knweiss.
I think you may be trying to do things a bit backwards here. Setting
--node-external-ipto the same value for all of your nodes has big implications as that affects core K8s/K3s behavior and can lead to unexpected behavior.What I would recommend in this case is to disable
servicelband then deploykube-vipconfigured to fulfill services with typeLoadBalancer.Regardless, if you hit this issue again (or have any desired clarifications on this), please open a new issue and be sure to mention me on it – I will let the K3s QA close out this issue when they finish validation of the edge case we identified above /cc @k3s-io/k3s-testing
@Oats87 Today, we repeated the stable channel upgrade from
v1.21.5+k3s2tov1.21.7+k3s1and much to our surprise this time it succeeded on all three nodes (with active--node-external-ipon all three nodes). Unfortunately, we don’t know what’s different this time. 😕 (We may did the last test with--disable servicelbincommon-optionsbut we’re not 100% sure anymore.)Regarding the
--node-external-ip $VIP: The idea is to use the kube-vip VIP as a LoadBalancer IP for Traefik and not only to access the control plane. We have a wildcard DNS*.domain.localthat points to this VIP. Traefik is the IngressController for our services andcert-managerprovides TLS certs for all DNS names (e.g.svc1.domain.local).In the default k3s configuration with three k3s servers (also used as workers) Traefik will use the three node-local IP addresses as Traefik’s LoadBalancer IPs. This works. However, if DNS resolution for external service names points to one of those three node-local IPs the service would not be available during maintenance of this (server) node. To prevent this situation we came up with the
--node-external-ip $VIPsolution. Do you think this is a bad idea?The
--node-external-ipis a very recent change in our setup (we did not do much testing yet). The only issue we noticed so far is that thehelm-install-traefik*pods had problems starting while the VIP was not on their node.(Notice, the external IP (VIP) is shown on all three (server) nodes but is only active on one at a time.)
@Oats87
The other server nodes would be started like this (as mentioned we did not reach this point with
v1.21.7+k3s1as the init node did not start successfully because of the unmarshaling issue):(Gonna move
--node-external-ipinto thecommon_optionsvariable…)MySQL for HA
I’ll have detailed steps describing my process and observations later today, but briefly:
architectural context/details:
rke2-server, and perform upgrade as described in snapshot backup/restore docAs an aside, prior to 1.21.7 I was able to do this by just manually cycling ASG instances one by one.
Anyway, hope that helps clarify my process at least, for what it’s worth. I’ll get some actual log output etc later today.