rancher: SIGSEGV segmentation violation when creating custom cluster

Rancher Server Setup

  • Rancher version: 2.6.2
  • Installation option (Docker install/Helm Chart): Helm on RKE2

Information about the Cluster

  • Kubernetes version: v1.21.4+rke2r3
  • Cluster Type (Local/Downstream): Custom RKE2

Describe the bug When a RKE2 custom cluster node tries to join the server, a SIGSEGV is thrown by Rancher and it crashes.

To Reproduce

  1. Create a RKE2 custom cluster, I am using the Terraform Provider to do this on vSphere
  2. Attempt to join a node to that cluster
  3. Witness a SIGSEGV

Result The Rancher 2 Pod will crash with the following output

2021/10/26 16:57:40 [INFO] rkecluster fleet-default/example: uncordoning bootstrap node(s) custom-d7d698863879: waiting for cluster agent to be available
E1026 16:57:40.037421      33 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 5504 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x39b9940, 0x6920620)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/runtime/runtime.go:48 +0x86
panic(0x39b9940, 0x6920620)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/rancher/rancher/pkg/controllers/provisioningv2/rke2/machinedrain.(*handler).undrain(0xc002481260, 0xc000211d80, 0x41bfaec, 0x16, 0xc00122f778)
        /go/src/github.com/rancher/rancher/pkg/controllers/provisioningv2/rke2/machinedrain/machinedrain.go:74 +0xc9
github.com/rancher/rancher/pkg/controllers/provisioningv2/rke2/machinedrain.(*handler).OnChange(0xc002481260, 0xc0025a68d0, 0x21, 0xc000211d80, 0x3fe3000, 0x410dc20, 0x494ade0)
        /go/src/github.com/rancher/rancher/pkg/controllers/provisioningv2/rke2/machinedrain/machinedrain.go:48 +0xd4
github.com/rancher/rancher/pkg/generated/controllers/cluster.x-k8s.io/v1alpha4.FromMachineHandlerToHandler.func1(0xc0025a68d0, 0x21, 0x48a88b0, 0xc000211d80, 0xc000211d80, 0x494ade0, 0xc000211d80, 0x1)
        /go/src/github.com/rancher/rancher/pkg/generated/controllers/cluster.x-k8s.io/v1alpha4/machine.go:105 +0x6b
github.com/rancher/lasso/pkg/controller.SharedControllerHandlerFunc.OnChange(0xc0029b4b10, 0xc0025a68d0, 0x21, 0x48a88b0, 0xc000211d80, 0x0, 0xc000211d80, 0x0, 0x0)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/sharedcontroller.go:29 +0x4e
github.com/rancher/lasso/pkg/controller.(*SharedHandler).OnChange(0xc002151a40, 0xc0025a68d0, 0x21, 0x48a88b0, 0xc000211d80, 0xc0031e2b01, 0x0)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/sharedhandler.go:69 +0x14c
github.com/rancher/lasso/pkg/controller.(*controller).syncHandler(0xc000bffb80, 0xc0025a68d0, 0x21, 0xc0031e2c58, 0x0)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:215 +0xd1
github.com/rancher/lasso/pkg/controller.(*controller).processSingleItem(0xc000bffb80, 0x3793300, 0xc00aeffdf0, 0x0, 0x0)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:197 +0xe7
github.com/rancher/lasso/pkg/controller.(*controller).processNextWorkItem(0xc000bffb80, 0x203001)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:174 +0x54
github.com/rancher/lasso/pkg/controller.(*controller).runWorker(...)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:163
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001e75a20)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001e75a20, 0x487a680, 0xc001854960, 0x6567616e616d2201, 0xc000115f80)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001e75a20, 0x3b9aca00, 0x0, 0x2261746164617401, 0xc000115f80)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc001e75a20, 0x3b9aca00, 0xc000115f80)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/wait/wait.go:90 +0x4d
created by github.com/rancher/lasso/pkg/controller.(*controller).run
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:134 +0x33b
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x2d17c69]

goroutine 5504 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/runtime/runtime.go:55 +0x109
panic(0x39b9940, 0x6920620)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/rancher/rancher/pkg/controllers/provisioningv2/rke2/machinedrain.(*handler).undrain(0xc002481260, 0xc000211d80, 0x41bfaec, 0x16, 0xc00122f778)
        /go/src/github.com/rancher/rancher/pkg/controllers/provisioningv2/rke2/machinedrain/machinedrain.go:74 +0xc9
github.com/rancher/rancher/pkg/controllers/provisioningv2/rke2/machinedrain.(*handler).OnChange(0xc002481260, 0xc0025a68d0, 0x21, 0xc000211d80, 0x3fe3000, 0x410dc20, 0x494ade0)
        /go/src/github.com/rancher/rancher/pkg/controllers/provisioningv2/rke2/machinedrain/machinedrain.go:48 +0xd4
github.com/rancher/rancher/pkg/generated/controllers/cluster.x-k8s.io/v1alpha4.FromMachineHandlerToHandler.func1(0xc0025a68d0, 0x21, 0x48a88b0, 0xc000211d80, 0xc000211d80, 0x494ade0, 0xc000211d80, 0x1)
        /go/src/github.com/rancher/rancher/pkg/generated/controllers/cluster.x-k8s.io/v1alpha4/machine.go:105 +0x6b
github.com/rancher/lasso/pkg/controller.SharedControllerHandlerFunc.OnChange(0xc0029b4b10, 0xc0025a68d0, 0x21, 0x48a88b0, 0xc000211d80, 0x0, 0xc000211d80, 0x0, 0x0)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/sharedcontroller.go:29 +0x4e
github.com/rancher/lasso/pkg/controller.(*SharedHandler).OnChange(0xc002151a40, 0xc0025a68d0, 0x21, 0x48a88b0, 0xc000211d80, 0xc0031e2b01, 0x0)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/sharedhandler.go:69 +0x14c
github.com/rancher/lasso/pkg/controller.(*controller).syncHandler(0xc000bffb80, 0xc0025a68d0, 0x21, 0xc0031e2c58, 0x0)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:215 +0xd1
github.com/rancher/lasso/pkg/controller.(*controller).processSingleItem(0xc000bffb80, 0x3793300, 0xc00aeffdf0, 0x0, 0x0)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:197 +0xe7
github.com/rancher/lasso/pkg/controller.(*controller).processNextWorkItem(0xc000bffb80, 0x203001)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:174 +0x54
github.com/rancher/lasso/pkg/controller.(*controller).runWorker(...)
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:163
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001e75a20)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001e75a20, 0x487a680, 0xc001854960, 0x6567616e616d2201, 0xc000115f80)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001e75a20, 0x3b9aca00, 0x0, 0x2261746164617401, 0xc000115f80)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc001e75a20, 0x3b9aca00, 0xc000115f80)
        /go/pkg/mod/k8s.io/apimachinery@v0.21.0/pkg/util/wait/wait.go:90 +0x4d
created by github.com/rancher/lasso/pkg/controller.(*controller).run
        /go/pkg/mod/github.com/rancher/lasso@v0.0.0-20210616224652-fc3ebd901c08/pkg/controller/controller.go:134 +0x33b

The system agent says:


Oct 26 17:06:01 example-02 rancher-system-agent[1409]: time="2021-10-26T17:06:01Z" level=debug msg="[K8s] Processing secret custom-3751b382200a-machine-plan in namespace fleet-default at generation 0 with resource version 11200452"
Oct 26 17:06:01 example-02 rancher-system-agent[1409]: W1026 17:06:01.957582    1409 reflector.go:437] pkg/mod/github.com/rancher/client-go@v0.21.0-rancher.1/tools/cache/reflector.go:168: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 85; INTERNAL_ERROR") has prevented the request from succeeding

Expected Result The node should be able to join the cluster

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (11 by maintainers)

Most upvoted comments

Worked with @braunsonm and determined this appears to be happening due to an upgrade strategy being defined on the cluster that is causing Rancher/v2prov to want to drain a machine when there is not a noderef for it.

While I am unable to reproduce this on my lab setup, @braunsonm was able to successfully provision a cluster with no errors by removing the upgrade_strategy stanza from TF.

The upgrade strategy configuration for the cluster was:

upgradeStrategy:
      controlPlaneDrainOptions:
        enabled: true
        ignoreDaemonSets: true
        timeout: 300
      workerDrainOptions:
        enabled: true
        ignoreDaemonSets: true
        timeout: 300