rancher: Rancher crashes with the error [FATAL] k3s exited with: exit status 255

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

  • run Rancher:master-head ec04f78, docker install

  • add a cluster using ec2 k8s 1.19.2-rancher1-1

  • do some operations in the cluster, like installing/uninstalling monitoring v2

Result:

  • Rancher crashes with the following error in its logos
WARNING: 2020/09/28 22:35:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-09-28 22:35:50.392320 W | etcdserver: read-only range request "key:\"/registry/apiextensions.k8s.io/customresourcedefinitions\" range_end:\"/registry/apiextensions.k8s.io/customresourcedefinitiont\" count_only:true " with result "error:context canceled" took too long (1.438297145s) to execute
WARNING: 2020/09/28 22:35:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2020/09/28 22:35:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2020/09/28 22:35:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2020/09/28 22:35:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-09-28 22:35:50.392448 W | etcdserver: read-only range request "key:\"/registry/configmaps/fleet-system/gitjob\" " with result "error:context canceled" took too long (1.214651514s) to execute
WARNING: 2020/09/28 22:35:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-09-28 22:35:50.392494 W | etcdserver: read-only range request "key:\"/registry/namespaces/kube-system\" " with result "error:context canceled" took too long (1.248120392s) to execute
WARNING: 2020/09/28 22:35:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-09-28 22:35:50.392539 W | etcdserver: read-only range request "key:\"/registry/controllers\" range_end:\"/registry/controllert\" count_only:true " with result "error:context canceled" took too long (1.26574346s) to execute
WARNING: 2020/09/28 22:35:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
W0928 22:35:50.397496       6 reflector.go:425] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.1/tools/cache/reflector.go:157: watch of *summary.SummarizedObject ended with: very short watch: pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.1/tools/cache/reflector.go:157: Unexpected watch close - watch lasted less than a second and no items received
...
...
W0928 22:35:50.404731       6 reflector.go:425] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.1/tools/cache/reflector.go:157: watch of *summary.SummarizedObject ended with: very short watch: pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.1/tools/cache/reflector.go:157: Unexpected watch close - watch lasted less than a second and no items received
2020/09/28 22:35:50 [FATAL] k3s exited with: exit status 255
2020/09/28 22:35:53 [INFO] Rancher version ec04f7878 (ec04f7878) is starting

Here is the full logs of the setup logs.txt

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 3
  • Comments: 29 (4 by maintainers)

Most upvoted comments

I had same issue on Debian Bullseye with 5.10.70 kernel. Stranger, if you still looking how to resolve that issue then try next steps (it works for me, with latest rancher/rancher):

Edit the GRUB config:

sudo nano /etc/default/grub

Then set/append to GRUB_CMDLINE_LINUX variable cgroup_memory=1 cgroup_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=0

GRUB_CMDLINE_LINUX="cgroup_memory=1 cgroup_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=0"

^ (I didn’t test with any single option, so maybe some options are not necessary)

Save the file. Then update GRUB:

sudo update-grub

And reboot.

sudo reboot

изображение

It’s now June and NO ONE at Rancher has a clue what to do about their product crashing on restart? All did was restart the VM, there is NO reason why this should happen. 😦

I have the same problem

stable does not work for me either

Just updated to Pop!_OS 21.10 then rancher do not work anymore. I was getting a lot of k3s exits and cannot connect to http://120.0.0.1:6443 errors. Reinstating rancher failed too.

Seemed related …

PopOs is using kernelstub not grub so thanks to that guy I fixed it with :

sudo kernelstub -a "systemd.unified_cgroup_hierarchy=0"  
sudo update-initramfs -c -k all
# reboot

I am unable to run rancher/rancher:latest on Ubuntu 20.4 LTS - on Digitalocean.

I was able to run rancher/server:stable the v1.6 version - but half the Digitalocean integrations don’t work. output.log

Switching to the “stable” version worked for me.

docker run -d --restart=unless-stopped \
  -p 80:80 -p 443:443 \
  -v /opt/rancher:/var/lib/rancher \
  --privileged \
  rancher/rancher:stable

I hit the fatal error in another rancher:master-head, ec04f78 docker install setup the only thing I did in the setup was to provision and delete an RKE cluster

one crash

Trace[1281344433]: [6.763328647s] [6.763303821s] About to write a response
I0928 22:51:29.578023      24 trace.go:116] Trace[2081261711]: "List etcd3" key:/services/specs,resourceVersion:,limit:0,continue: (started: 2020-09-28 22:51:20.809684085 +0000 UTC m=+185.686606554) (total time: 8.768318326s):
Trace[2081261711]: [8.768318326s] [8.768318326s] END
I0928 22:51:29.578625      24 trace.go:116] Trace[1951576330]: "GuaranteedUpdate etcd3" type:*core.RangeAllocation (started: 2020-09-28 22:51:28.187095209 +0000 UTC m=+193.064017678) (total time: 1.391501075s):
Trace[1951576330]: [1.39147605s] [1.39147605s] initial value restored
I0928 22:51:29.579051      24 trace.go:116] Trace[1326055750]: "Get" url:/api/v1/namespaces/kube-system,user-agent:rancher/v0.0.0 (linux/amd64) kubernetes/$Format,client:127.0.0.1 (started: 2020-09-28 22:51:27.00697806 +0000 UTC m=+191.883900530) (total time: 2.572054999s):
Trace[1326055750]: [2.571969918s] [2.571944736s] About to write a response
E0928 22:51:29.648928       6 leaderelection.go:357] Failed to update lock: Put "https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s": EOF
2020/09/28 22:51:29 [FATAL] k3s exited with: exit status 255

and another one

W0928 22:53:32.170009       6 reflector.go:425] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.1/tools/cache/reflector.go:157: watch of *summary.SummarizedObject ended with: very short watch: pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.1/tools/cache/reflector.go:157: Unexpected watch close - watch lasted less than a second and no items received
2020/09/28 22:53:32 [FATAL] k3s exited with: exit status 255

I’m getting same error using (2.6) rancher:latest, rancher:stable, rancher:v2.5.16. I do NOT get this error using rancher:v2.4.18. I’m using clean Ubuntu Server 22.04 VM on ProxMox, Docker 20.10.17 installed using modified version number in Rancher docker install script. 8G RAM, 4CPUs, 100 GB disk.

I’m receiving this error with all versions of rancher v2.5+ using the stock startup script (fresh install)

i’ve tried:

sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:stable sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:latest sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:v2.5.0 sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:v2.5.1 sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:v2.5.9

…etc. all of them fail out of the box. v2.4.17 seems to start up fine

EDIT: v2.6-head works for me as well

it’s friend, the year is 2022 and I’m going through the same.

I’m getting same error using (2.6) rancher:latest, rancher:stable, rancher:v2.5.16. I do NOT get this error using rancher:v2.4.18. I’m using clean Ubuntu Server 22.04 VM on ProxMox, Docker 20.10.17 installed using modified version number in Rancher docker install script. 8G RAM, 4CPUs, 100 GB disk.

same here, using command on a vagrant ubuntu/focal64 box: docker run -d --privileged --name rancher-server --restart=unless-stopped -p 8080:80 -p 8443:443 -e CATTLE_BOOTSTRAP_PASSWORD=XXX-v /opt/rancher:/var/lib/rancher rancher/rancher:v2.6.0

Last lines of log:

...
...
...
W1007 07:03:29.075052      33 reflector.go:437] pkg/mod/github.com/rancher/client-go@v0.21.0-rancher.1/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: very short watch: pkg/mod/github.com/rancher/client-go@v0
.21.0-rancher.1/tools/cache/reflector.go:168: Unexpected watch close - watch lasted less than a second and no items received
W1007 07:03:29.075109      33 reflector.go:437] pkg/mod/github.com/rancher/client-go@v0.21.0-rancher.1/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: very short watch: pkg/mod/github.com/rancher/client-go@v0
.21.0-rancher.1/tools/cache/reflector.go:168: Unexpected watch close - watch lasted less than a second and no items received
W1007 07:03:29.075166      33 reflector.go:437] pkg/mod/github.com/rancher/client-go@v0.21.0-rancher.1/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: very short watch: pkg/mod/github.com/rancher/client-go@v0
.21.0-rancher.1/tools/cache/reflector.go:168: Unexpected watch close - watch lasted less than a second and no items received
W1007 07:03:29.075212      33 reflector.go:437] pkg/mod/github.com/rancher/client-go@v0.21.0-rancher.1/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: very short watch: pkg/mod/github.com/rancher/client-go@v0
.21.0-rancher.1/tools/cache/reflector.go:168: Unexpected watch close - watch lasted less than a second and no items received
2021/10/07 07:03:29 [FATAL] k3s exited with: exit status 255

Had the same issue and v2.6.0-rc10 worked fine

I’m receiving this error with all versions of rancher v2.5+ using the stock startup script (fresh install)

i’ve tried:

sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:stable sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:latest sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:v2.5.0 sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:v2.5.1 sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher:v2.5.9

…etc. all of them fail out of the box. v2.4.17 seems to start up fine

EDIT: v2.6-head works for me as well

[exit status 255] How to adjust etcd parameters (heartbeat-interval election-timeout)?

  • single Node Using Docker
    • arm64( ARMv8 Processor rev 4 (v8l))
    • os:armbian(4 vCPU, 2GB memory) 1、v2.4.8 was successfully deployed before, and used for quite a long time. 2、upgraded v2.5.2 :[FATAL] k3s exited with: exit status 255. redeployment v2.4.8 also :[FATAL] k3s exited with: exit status 255

uname -a ░▒▓ 127 х Linux aml2 5.3.0-aml-g12 #19.11.3 SMP PREEMPT Wed Nov 27 10:39:48 MSK 2019 aarch64 aarch64 aarch64 GNU/Linux arm64 armbian - ubuntu 18.04 4c+2gram

v2.4.8

  • 2020/11/12 08:06:24 [INFO] Done waiting for CRD sourcecoderepositories.project.cattle.io to become available,
  • 2020-11-12 08:06:28.381181 W | wal: sync duration of 4.26296889s, expected less than 1s,
  • I1112 08:06:34.096856 32 leaderelection.go:288] failed to renew lease kube-system/kube-controller-manager: failed - to tryAcquireOrRenew context deadline exceeded,
  • 2020/11/12 08:06:34 [FATAL] k3s exited with: exit status 255,
  • F1112 08:06:34.097544 32 controllermanager.go:279] leaderelection lost,
  • 2020-11-12 08:06:34.868760 W | etcdserver: read-only range request "key:"/registry/replicasets" range_end:"/registry/- replicasett" count_only:true " with result “error:context canceled” took too long (1.600306541s) to execute,
  • E1112 08:06:34.873351 6 reflector.go:384] github.com/rancher/norman/controller/generic_controller.go:237: Failed - to watch *v3.Feature: Get https://127.0.0.1:6443/apis/management.cattle.io/v3/watch/features?allowWatchBookmarks=true&- resourceVersion=237&timeout=30m0s&timeoutSeconds=582: read tcp 127.0.0.1:49660->127.0.0.1:6443: read: connection reset by peer,

2.5.2

  • I1112 07:52:46.268417 30 trace.go:116] Trace[954593231]: “GuaranteedUpdate etcd3” type:*core.ConfigMap (started: - 2020-11-12 07:52:44.959054077 +0000 UTC m=+246.415603389) (total time: 1.309216055s):,
  • Trace[954593231]: [1.309089428s] [1.307618564s] Transaction committed,
  • I1112 07:52:48.181845 30 trace.go:116] Trace[353422474]: “Update” url:/api/v1/namespaces/kube-system/configmaps/- cattle-controllers,user-agent:rancher/v0.0.0 (linux/arm64) kubernetes/$Format,client:127.0.0.1 (started: 2020-11-12 - 07:52:44.958447606 +0000 UTC m=+246.414996877) (total time: 3.223161446s):,
  • Trace[353422474]: [3.222490099s] [3.222060174s] Object stored in database,
  • I1112 07:52:46.293682 30 job_controller.go:156] Shutting down job controller,
  • I1112 07:52:46.743071 30 horizontal.go:180] Shutting down HPA controller,
  • F1112 07:52:46.221778 30 controllermanager.go:279] leaderelection lost,
  • 2020/11/12 07:52:51 [FATAL] k3s exited with: exit status 255, log.zip