rancher: [BUG] Fleet-agent panics on k3s node driver cluster and doesn't recover

Information about the Cluster Rancher Server Setup

  • Rancher version: v2.8.0-alpha1, Fleet version 103.1.0+up0.9.0-rc.3
  • Installation option (Docker install/Helm Chart): Helm Chart
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE1, v1.27.6-rancher1-1, 1.5.0-rc5
  • Proxy/Cert Details: Self-signed

Information about the Cluster

  • Kubernetes version: v1.27.5+k3s1
  • Cluster Type (Local/Downstream): Infrastructure provider 3 worker, 1 etcd, 1 cp k3s

Describe the bug Fleet-agent panics on k3s node driver cluster and doesn’t recover. Its stuck on updating state.

To Reproduce

  1. Create a rancher HA server with 3 nodes on an rke1 server
  2. Create a downstream k3s node driver with 5 nodes.
  3. 2017 users were created and added in the downstream cluster as a cluster-owner and the crashloop is noticed.

Result Fleet-agent pod is in crash-loop back off with the following panic:

I0928 18:14:53.356518       1 leaderelection.go:245] attempting to acquire leader lease cattle-fleet-system/fleet-agent-lock...
2023-09-28T18:14:53.390312936Z I0928 18:14:53.390199       1 leaderelection.go:255] successfully acquired lease cattle-fleet-system/fleet-agent-lock
2023-09-28T18:14:53.395045943Z time="2023-09-28T18:14:53Z" level=warning msg="Cannot find fleet-agent secret, running registration"
2023-09-28T18:14:53.680689231Z panic: assignment to entry in nil map
2023-09-28T18:14:53.680732312Z 
2023-09-28T18:14:53.680736843Z goroutine 52 [running]:
2023-09-28T18:14:53.681029990Z github.com/rancher/fleet/internal/cmd/agent/register.runRegistration({0x2cca1d8, 0xc0001185f0}, {0x2cd7a08, 0xc0004d8390}, {0xc00005800a, 0x13}, {0x0, 0x0})
2023-09-28T18:14:53.681036620Z 	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:169 +0x4e5
2023-09-28T18:14:53.681038880Z github.com/rancher/fleet/internal/cmd/agent/register.tryRegister({0x2cca1d8, 0xc0001185f0}, {0xc00005800a, 0x13}, {0x0, 0x0}, 0x17fa000010000?)
2023-09-28T18:14:53.681040971Z 	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:79 +0x313
2023-09-28T18:14:53.681042551Z github.com/rancher/fleet/internal/cmd/agent/register.Register({0x2cca1d8, 0xc0001185f0}, {0xc00005800a, 0x13}, {0x0, 0x0}, 0x41c4de0?)
2023-09-28T18:14:53.681044060Z 	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:51 +0x88
2023-09-28T18:14:53.681045571Z github.com/rancher/fleet/internal/cmd/agent.start.func1({0x2cca1d8, 0xc0001185f0})
2023-09-28T18:14:53.681047991Z 	/go/src/github.com/rancher/fleet/internal/cmd/agent/start.go:58 +0x92
2023-09-28T18:14:53.681049480Z created by github.com/rancher/wrangler/pkg/leader.run.func1 in goroutine 50
2023-09-28T18:14:53.681050980Z 	/go/pkg/mod/github.com/rancher/wrangler@v1.1.1/pkg/leader/leader.go:58 +0x90

Following logs were noticed in the rancher pods:

2023/09/28 18:25:55 [INFO] [planner] rkecluster fleet-default/anuk3snewperf - machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc - previous join server () was not valid, using new join server (https://[2600:3c01::f03c:93ff:fe00:698a]:6443)
2023/09/28 18:25:55 [INFO] [planner] rkecluster fleet-default/anuk3snewperf: waiting for machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc driver config to be saved
2023/09/28 18:25:55 [INFO] [planner] rkecluster fleet-default/anuk3snewperf - machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc - previous join server () was not valid, using new join server (https://[2600:3c01::f03c:93ff:fe00:698a]:6443)
2023/09/28 18:25:55 [INFO] [planner] rkecluster fleet-default/anuk3snewperf: waiting for machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc driver config to be saved
2023/09/28 18:25:56 [INFO] [planner] rkecluster fleet-default/anuk3snewperf - machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc - previous join server () was not valid, using new join server (https://[2600:3c01::f03c:93ff:fe00:698a]:6443)
2023/09/28 18:25:56 [INFO] [planner] rkecluster fleet-default/anuk3snewperf: waiting for machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc driver config to be saved


I0928 18:48:20.966148      33 trace.go:236] Trace[2097947468]: "DeltaFIFO Pop Process" ID:fleet-default/crt-anuk3snewperf-crtb-grb-44p9v-rrptdbdmsf,Depth:9968,Reason:slow event handlers blocking the queue (28-Sep-2023 18:48:20.838) (total time: 128ms):
I0928 18:48:21.683776      33 trace.go:236] Trace[1394317482]: "DeltaFIFO Pop Process" ID:fleet-default/crt-anuk3snewperf-crtb-grb-zwrvf-cn6behj35g,Depth:9967,Reason:slow event handlers blocking the queue (28-Sep-2023 18:48:20.966) (total time: 717ms):
I0928 18:48:22.447887      33 trace.go:236] Trace[516340835]: "DeltaFIFO Pop Process" ID:fleet-default/crt-anuk3snewperf-crtb-grb-4vlpx-j4vb7lcspe,Depth:9964,Reason:slow event handlers blocking the queue (28-Sep-2023 18:48:22.195) (total time: 252ms):
I0928 18:48:24.431746      33 trace.go:236] Trace[1537574149]: "DeltaFIFO Pop Process" ID:fleet-default/crt-anuk3snewperf-crtb-grb-2csml-buttzmfwgc,Depth:9948,Reason:slow event handlers blocking the queue (28-Sep-2023 18:48:24.243) (total time: 187ms):

Expected Result fleet-agent pod does not panic and not stuck in updating state with the pods in crash loop backoff

Additional context Unable to reproduce the issue retracing the steps on another setup. Adding additional nodes to the cluster and creation of deployment works after the pod is stuck in the crashloop back off

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

So how can I install 0.9 in 2.7.9?

System information:

Cluster used: k3s 
k8s version: v1.27.6+k3s1
Rancher Version: 2.8-head
Fleet version: 0.9.0-rc.4

I have followed the steps mentioned in the description. I didn’t see any warning or error logs in the fleet-agent or fleet-controller pods.

Also, performed sanity testing around it by creating GitRepo which installs application from helm as well normal app deployment. Applications are deployed successfully without any error.

rancher 2.8-head as of today is still using rc3. Do we need a rancher ticket to get this bumped on the rancher side?

Fleet 0.9.0-rc.4 now released.

Hi, we experimented the same issue on: Kubernetes 1.23.7 Rancher 2.7.6 fleet-agent 0.8.0

LOGS:

I1004 13:04:59.907282       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-system/fleet-agent-lock...
I1004 13:04:59.942974       1 leaderelection.go:258] successfully acquired lease cattle-fleet-system/fleet-agent-lock
time="2023-10-04T13:04:59Z" level=warning msg="Cannot find fleet-agent secret, running registration"
panic: assignment to entry in nil map

goroutine 51 [running]:
github.com/rancher/fleet/internal/cmd/agent/register.createAgentSecret({0x2b21500, 0xc00035b450}, {0x0, 0x0}, {0x2b2dd90, 0xc000412300}, 0xc0009788c0)
	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:174 +0x3dc
github.com/rancher/fleet/internal/cmd/agent/register.runRegistration({0x2b21500, 0xc00035b450}, {0x2b2dd90?, 0xc000412300?}, {0xc00005800a, 0x13}, {0x0, 0x0})
	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:118 +0x1af
github.com/rancher/fleet/internal/cmd/agent/register.tryRegister({0x2b21500, 0xc00035b450}, {0xc00005800a, 0x13}, {0x0, 0x0}, 0x459c05?)
	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:81 +0x325
github.com/rancher/fleet/internal/cmd/agent/register.Register({0x2b21500, 0xc00035b450}, {0xc00005800a, 0x13}, {0x0, 0x0}, 0x478837?)
	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:53 +0x97
github.com/rancher/fleet/internal/cmd/agent.start.func1({0x2b21500, 0xc00035b450})
	/go/src/github.com/rancher/fleet/internal/cmd/agent/start.go:58 +0x9e
created by github.com/rancher/wrangler/pkg/leader.run.func1
	/go/pkg/mod/github.com/rancher/wrangler@v1.1.1/pkg/leader/leader.go:58 +0x98

Temporary resolved downgrading fleet agent to 0.7.1, that runs without problems.