longhorn: [BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart

Describe the bug (🐛 if you encounter this issue)

After cluster restart (reboot all k8s nodes including control plane node), all csi components got stuck in CrashLoopBackOff:

╰─$ kl get pods
NAME                                                     READY   STATUS             RESTARTS          AGE
longhorn-manager-pxlcx                                   1/1     Running            16 (10h ago)      15h
longhorn-manager-gr26g                                   1/1     Running            16 (10h ago)      15h
engine-image-ei-b907910b-m6gr4                           1/1     Running            16 (10h ago)      15h
instance-manager-bb7e4a2035b50ec3896800262c56800c        1/1     Running            0                 10h
engine-image-ei-b907910b-7vfnt                           1/1     Running            16 (10h ago)      15h
instance-manager-dca4e951d12ab6ab6f714f42f1a2416e        1/1     Running            0                 10h
csi-attacher-67f8c99bcd-9fqwj                            1/1     Running            20 (10h ago)      15h
csi-snapshotter-6f6bf9c757-t68d7                         1/1     Running            19 (10h ago)      15h
csi-resizer-545b7c64f5-7l2hv                             1/1     Running            19 (10h ago)      15h
csi-attacher-67f8c99bcd-2chq8                            1/1     Running            19 (10h ago)      15h
csi-provisioner-557c7f7c44-g2x77                         1/1     Running            20 (10h ago)      15h
longhorn-driver-deployer-7f94bb668f-s7zwg                1/1     Running            16 (10h ago)      15h
longhorn-csi-plugin-h5xqw                                3/3     Running            72 (10h ago)      15h
longhorn-ui-646f4bc8df-tbmtt                             1/1     Running            29 (10h ago)      15h
csi-resizer-545b7c64f5-gzm2z                             1/1     Running            19 (10h ago)      15h
csi-provisioner-557c7f7c44-2p6lc                         1/1     Running            19 (10h ago)      15h
engine-image-ei-b907910b-9ztgm                           1/1     Running            16 (10h ago)      15h
csi-snapshotter-6f6bf9c757-jqqp9                         1/1     Running            19 (10h ago)      15h
longhorn-csi-plugin-4sqkv                                3/3     Running            69 (10h ago)      15h
longhorn-manager-dfn6d                                   1/1     Running            16 (10h ago)      15h
instance-manager-0435193fed60526c87fd3a53fb03ba39        1/1     Running            0                 10h
share-manager-pvc-911b2b67-1e45-4f26-914f-23f2090af998   1/1     Running            0                 10h
share-manager-pvc-81c99bd1-b424-4be2-a075-80e8b17334ab   1/1     Running            0                 10h
longhorn-ui-646f4bc8df-c9tqj                             1/1     Running            35 (10h ago)      15h
csi-attacher-67f8c99bcd-jsj54                            0/1     CrashLoopBackOff   141 (2m49s ago)   15h
longhorn-csi-plugin-kxtqn                                0/3     CrashLoopBackOff   535 (2m18s ago)   15h
csi-snapshotter-6f6bf9c757-xwmmn                         0/1     CrashLoopBackOff   141 (2m14s ago)   15h
csi-provisioner-557c7f7c44-wd57k                         0/1     CrashLoopBackOff   141 (2m5s ago)    15h
csi-resizer-545b7c64f5-8jxk4                             0/1     CrashLoopBackOff   141 (61s ago)     15h

They are all unable to connect to unix://csi/csi.sock:

╰─$ kl logs csi-attacher-67f8c99bcd-jsj54
I1116 01:28:03.328201       1 main.go:97] Version: v4.4.0
W1116 01:28:13.329951       1 connection.go:183] Still connecting to unix:///csi/csi.sock
W1116 01:28:23.329624       1 connection.go:183] Still connecting to unix:///csi/csi.sock
W1116 01:28:33.330048       1 connection.go:183] Still connecting to unix:///csi/csi.sock
E1116 01:28:33.330124       1 main.go:136] context deadline exceeded

To Reproduce

Run negative test case Restart Cluster While Workload Heavy Writing repeatedly.

Expected behavior

Support bundle for troubleshooting

supportbundle_ae1d1892-8da7-4733-97d7-16326976bb0e_2023-11-16T03-17-09Z.zip

worker nodes logs: worker_nodes_log.txt

Environment

  • Longhorn version: v1.5.x-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:
  • Impacted Longhorn resources:
    • Volume names:

Additional context

https://suse.slack.com/archives/C02DR3N5T24/p1700095454722879?thread_ts=1699951700.806219&cid=C02DR3N5T24

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 16 (13 by maintainers)

Most upvoted comments

I kicked off https://ci.longhorn.io/job/private/job/longhorn-e2e-test/129/ to try to reproduce it for live debugging. If that doesn’t work, we will likely need to wait for another occurrence.

This didn’t help to reproduce the issue.

I hit this one when restarting a one-node cluster. I was planning to create a ticket but looks like one already exist here. The issue in my case is that longhorn-csi-plugin stays in crashing loop for about 4-5 mins then it recovers. Not sure if this is just slowing issue

I’ll look into this a bit when I have time to see if there is an improvement we can make. It seems different from the original event, since, in that case, the CSI components never recovered.

I reviewed #6916 and it looks like we really do need to go to livenessprobe v2.11.0 if we want to mitigate the CVEs it mentions.

The upstream bug has some traction, but it’s unlikely we’ll see a version we can grab and test before v1.6.0 releases. I’ll move forward with my mitigation PR.

* [x]  Create an issue to revert the mitigation when we upgrade livenessprobe if the mitigation changes are approved and merged.

ref: https://github.com/longhorn/longhorn/issues/7428

Reproduce steps:

  • kubectl edit -n longhorn-system service longhorn-backend and change the port from 9500 to 9501. This effectively breaks the longhorn-backend service temporarily.
  • kubectl delete -n longhorn-system pod longhorn-csi-plugin-<xxxxxx>.
  • Wait some length of time. All three containers go into CrashLoopBackoff.
  • kubectl edit -n longhorn-system service longhorn-backend and restore the correct port.
  • Monitor longhorn-csi-plugin. It can never recover (or at least not for a long time) and exhibits the behavior described above.

In the example below, I kept the longhorn-backend service broken for 6m30s. The situation was not improved 7m30s after that.

longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   16 (1s ago)     6m36s
longhorn-csi-plugin-6zxtn                           2/3     CrashLoopBackOff   18 (2m11s ago)   8m46s
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   18 (2m41s ago)   9m16s
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   18 (1s ago)      9m17s
longhorn-csi-plugin-6zxtn                           1/3     CrashLoopBackOff   19 (2m27s ago)   11m
longhorn-csi-plugin-6zxtn                           1/3     CrashLoopBackOff   20 (2s ago)      11m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   20 (1s ago)      12m
longhorn-csi-plugin-6zxtn                           2/3     CrashLoopBackOff   22 (2m16s ago)   14m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   22 (2m46s ago)   14m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   22 (2s ago)      14m

Deleting longhorn-csi-plugin-6zxtn immediately resolved the issue.