longhorn: [BUG] Constant errors about io.rancher.longhorn-reg.sock

Describe the bug I have several errors in kubelet log of every node, and it’s the same that continues every X minutes:

I0527 14:47:10.544314   10814 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock" (plugin details: &{/var/lib/kubelet/plugins/io.rancher.longhorn-reg.socktrue 2020-03-21 22:46:57.40373808 +0000 UTC m=+21.619442778})
I0527 14:47:10.544409   10814 operation_generator.go:193] parsed scheme: ""
I0527 14:47:10.544427   10814 operation_generator.go:193] scheme "" not registered, fallback to default scheme
I0527 14:47:10.544452   10814 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock 0  <nil>}] <nil>}
I0527 14:47:10.544478   10814 clientconn.go:577] ClientConn switching balancer to "pick_first"
I0527 14:47:10.544581   10814 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc001438790, CONNECTING
W0527 14:47:10.544781   10814 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock: connect: connection refused". Reconnecting...
I0527 14:47:10.544846   10814 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc001438790, TRANSIENT_FAILURE
I0527 14:47:11.545032   10814 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc001438790, CONNECTING
W0527 14:47:11.545080   10814 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock: connect: connection refused". Reconnecting...
I0527 14:47:11.545120   10814 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc001438790, TRANSIENT_FAILURE
I0527 14:47:13.214754   10814 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc001438790, CONNECTING
W0527 14:47:13.214799   10814 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock: connect: connection refused". Reconnecting...
I0527 14:47:13.214866   10814 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc001438790, TRANSIENT_FAILURE
I0527 14:47:15.865668   10814 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc001438790, CONNECTING
W0527 14:47:15.865765   10814 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock: connect: connection refused". Reconnecting...
I0527 14:47:15.865828   10814 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc001438790, TRANSIENT_FAILURE
W0527 14:47:17.528690   10814 kubelet_pods.go:849] Unable to retrieve pull secret longhorn-system/ for longhorn-system/longhorn-csi-plugin-9mrj8 due to secret "" not found.  The image pull may not succeed.
E0527 14:47:20.544545   10814 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock" failed. No retries permitted until 2020-05-27 14:49:22.544513867 +0000 UTC m=+5760166.760218627 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock, err: failed to dial socket /var/lib/kubelet/plugins/io.rancher.longhorn-reg.sock, err: context deadline exceeded"

To Reproduce I have two clusters but this error happens only on one of them. It’s a RKE cluster, on 3 bare metal servers. The error is logged in all nodes. I already tried to remove longhorn with also manual remove of CRDs but the error PERSISTS also after remove longhorn. So I tried to install again (last version) and the error continues.

Expected behavior No errors in kubelet logs.

Environment:

  • Longhorn version: 0.8.1 (the same with 0.7.0)
  • Kubernetes version: 1.16.6
  • Node OS type and version: Ubuntu 18.04

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (16 by maintainers)

Most upvoted comments

Followed the instructions from https://longhorn.io/docs/1.0.0/deploy/upgrade/longhorn-manager/#cleanup-for-compatible-csi-plugin, the logs of error message related to io.rancher.longhorn-reg. disappear.

To repro the issue, tried steps from https://github.com/longhorn/longhorn/issues/1412#issuecomment-634927383

@Leen15 , thanks for your confirm, and good to hear the issue is gone for you.