rancher: [BUG] Fleet-agent panics on k3s node driver cluster and doesn't recover
Information about the Cluster Rancher Server Setup
- Rancher version:
v2.8.0-alpha1
, Fleet version103.1.0+up0.9.0-rc.3
- Installation option (Docker install/Helm Chart):
Helm Chart
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
RKE1
,v1.27.6-rancher1-1
,1.5.0-rc5
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
- Proxy/Cert Details:
Self-signed
Information about the Cluster
- Kubernetes version:
v1.27.5+k3s1
- Cluster Type (Local/Downstream):
Infrastructure provider 3 worker, 1 etcd, 1 cp k3s
Describe the bug Fleet-agent panics on k3s node driver cluster and doesn’t recover. Its stuck on updating state.
To Reproduce
- Create a rancher HA server with 3 nodes on an rke1 server
- Create a downstream k3s node driver with 5 nodes.
- 2017 users were created and added in the downstream cluster as a cluster-owner and the crashloop is noticed.
Result Fleet-agent pod is in crash-loop back off with the following panic:
I0928 18:14:53.356518 1 leaderelection.go:245] attempting to acquire leader lease cattle-fleet-system/fleet-agent-lock...
2023-09-28T18:14:53.390312936Z I0928 18:14:53.390199 1 leaderelection.go:255] successfully acquired lease cattle-fleet-system/fleet-agent-lock
2023-09-28T18:14:53.395045943Z time="2023-09-28T18:14:53Z" level=warning msg="Cannot find fleet-agent secret, running registration"
2023-09-28T18:14:53.680689231Z panic: assignment to entry in nil map
2023-09-28T18:14:53.680732312Z
2023-09-28T18:14:53.680736843Z goroutine 52 [running]:
2023-09-28T18:14:53.681029990Z github.com/rancher/fleet/internal/cmd/agent/register.runRegistration({0x2cca1d8, 0xc0001185f0}, {0x2cd7a08, 0xc0004d8390}, {0xc00005800a, 0x13}, {0x0, 0x0})
2023-09-28T18:14:53.681036620Z /go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:169 +0x4e5
2023-09-28T18:14:53.681038880Z github.com/rancher/fleet/internal/cmd/agent/register.tryRegister({0x2cca1d8, 0xc0001185f0}, {0xc00005800a, 0x13}, {0x0, 0x0}, 0x17fa000010000?)
2023-09-28T18:14:53.681040971Z /go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:79 +0x313
2023-09-28T18:14:53.681042551Z github.com/rancher/fleet/internal/cmd/agent/register.Register({0x2cca1d8, 0xc0001185f0}, {0xc00005800a, 0x13}, {0x0, 0x0}, 0x41c4de0?)
2023-09-28T18:14:53.681044060Z /go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:51 +0x88
2023-09-28T18:14:53.681045571Z github.com/rancher/fleet/internal/cmd/agent.start.func1({0x2cca1d8, 0xc0001185f0})
2023-09-28T18:14:53.681047991Z /go/src/github.com/rancher/fleet/internal/cmd/agent/start.go:58 +0x92
2023-09-28T18:14:53.681049480Z created by github.com/rancher/wrangler/pkg/leader.run.func1 in goroutine 50
2023-09-28T18:14:53.681050980Z /go/pkg/mod/github.com/rancher/wrangler@v1.1.1/pkg/leader/leader.go:58 +0x90
Following logs were noticed in the rancher pods:
2023/09/28 18:25:55 [INFO] [planner] rkecluster fleet-default/anuk3snewperf - machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc - previous join server () was not valid, using new join server (https://[2600:3c01::f03c:93ff:fe00:698a]:6443)
2023/09/28 18:25:55 [INFO] [planner] rkecluster fleet-default/anuk3snewperf: waiting for machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc driver config to be saved
2023/09/28 18:25:55 [INFO] [planner] rkecluster fleet-default/anuk3snewperf - machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc - previous join server () was not valid, using new join server (https://[2600:3c01::f03c:93ff:fe00:698a]:6443)
2023/09/28 18:25:55 [INFO] [planner] rkecluster fleet-default/anuk3snewperf: waiting for machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc driver config to be saved
2023/09/28 18:25:56 [INFO] [planner] rkecluster fleet-default/anuk3snewperf - machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc - previous join server () was not valid, using new join server (https://[2600:3c01::f03c:93ff:fe00:698a]:6443)
2023/09/28 18:25:56 [INFO] [planner] rkecluster fleet-default/anuk3snewperf: waiting for machine fleet-default/anuk3snewperf-w-585db4f7cxhsw64-5p8cc driver config to be saved
I0928 18:48:20.966148 33 trace.go:236] Trace[2097947468]: "DeltaFIFO Pop Process" ID:fleet-default/crt-anuk3snewperf-crtb-grb-44p9v-rrptdbdmsf,Depth:9968,Reason:slow event handlers blocking the queue (28-Sep-2023 18:48:20.838) (total time: 128ms):
I0928 18:48:21.683776 33 trace.go:236] Trace[1394317482]: "DeltaFIFO Pop Process" ID:fleet-default/crt-anuk3snewperf-crtb-grb-zwrvf-cn6behj35g,Depth:9967,Reason:slow event handlers blocking the queue (28-Sep-2023 18:48:20.966) (total time: 717ms):
I0928 18:48:22.447887 33 trace.go:236] Trace[516340835]: "DeltaFIFO Pop Process" ID:fleet-default/crt-anuk3snewperf-crtb-grb-4vlpx-j4vb7lcspe,Depth:9964,Reason:slow event handlers blocking the queue (28-Sep-2023 18:48:22.195) (total time: 252ms):
I0928 18:48:24.431746 33 trace.go:236] Trace[1537574149]: "DeltaFIFO Pop Process" ID:fleet-default/crt-anuk3snewperf-crtb-grb-2csml-buttzmfwgc,Depth:9948,Reason:slow event handlers blocking the queue (28-Sep-2023 18:48:24.243) (total time: 187ms):
Expected Result
fleet-agent pod does not panic and not stuck in updating
state with the pods in crash loop backoff
Additional context Unable to reproduce the issue retracing the steps on another setup. Adding additional nodes to the cluster and creation of deployment works after the pod is stuck in the crashloop back off
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 15 (5 by maintainers)
So how can I install 0.9 in 2.7.9?
System information:
I have followed the steps mentioned in the description. I didn’t see any warning or error logs in the
fleet-agent
orfleet-controller
pods.Also, performed sanity testing around it by creating GitRepo which installs application from helm as well normal app deployment. Applications are deployed successfully without any error.
rancher 2.8-head as of today is still using rc3. Do we need a rancher ticket to get this bumped on the rancher side?
Fleet
0.9.0-rc.4
now released.Hi, we experimented the same issue on: Kubernetes 1.23.7 Rancher 2.7.6 fleet-agent 0.8.0
LOGS:
Temporary resolved downgrading fleet agent to 0.7.1, that runs without problems.