longhorn: [BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart
Describe the bug (🐛 if you encounter this issue)
After cluster restart (reboot all k8s nodes including control plane node), all csi components got stuck in CrashLoopBackOff:
╰─$ kl get pods
NAME READY STATUS RESTARTS AGE
longhorn-manager-pxlcx 1/1 Running 16 (10h ago) 15h
longhorn-manager-gr26g 1/1 Running 16 (10h ago) 15h
engine-image-ei-b907910b-m6gr4 1/1 Running 16 (10h ago) 15h
instance-manager-bb7e4a2035b50ec3896800262c56800c 1/1 Running 0 10h
engine-image-ei-b907910b-7vfnt 1/1 Running 16 (10h ago) 15h
instance-manager-dca4e951d12ab6ab6f714f42f1a2416e 1/1 Running 0 10h
csi-attacher-67f8c99bcd-9fqwj 1/1 Running 20 (10h ago) 15h
csi-snapshotter-6f6bf9c757-t68d7 1/1 Running 19 (10h ago) 15h
csi-resizer-545b7c64f5-7l2hv 1/1 Running 19 (10h ago) 15h
csi-attacher-67f8c99bcd-2chq8 1/1 Running 19 (10h ago) 15h
csi-provisioner-557c7f7c44-g2x77 1/1 Running 20 (10h ago) 15h
longhorn-driver-deployer-7f94bb668f-s7zwg 1/1 Running 16 (10h ago) 15h
longhorn-csi-plugin-h5xqw 3/3 Running 72 (10h ago) 15h
longhorn-ui-646f4bc8df-tbmtt 1/1 Running 29 (10h ago) 15h
csi-resizer-545b7c64f5-gzm2z 1/1 Running 19 (10h ago) 15h
csi-provisioner-557c7f7c44-2p6lc 1/1 Running 19 (10h ago) 15h
engine-image-ei-b907910b-9ztgm 1/1 Running 16 (10h ago) 15h
csi-snapshotter-6f6bf9c757-jqqp9 1/1 Running 19 (10h ago) 15h
longhorn-csi-plugin-4sqkv 3/3 Running 69 (10h ago) 15h
longhorn-manager-dfn6d 1/1 Running 16 (10h ago) 15h
instance-manager-0435193fed60526c87fd3a53fb03ba39 1/1 Running 0 10h
share-manager-pvc-911b2b67-1e45-4f26-914f-23f2090af998 1/1 Running 0 10h
share-manager-pvc-81c99bd1-b424-4be2-a075-80e8b17334ab 1/1 Running 0 10h
longhorn-ui-646f4bc8df-c9tqj 1/1 Running 35 (10h ago) 15h
csi-attacher-67f8c99bcd-jsj54 0/1 CrashLoopBackOff 141 (2m49s ago) 15h
longhorn-csi-plugin-kxtqn 0/3 CrashLoopBackOff 535 (2m18s ago) 15h
csi-snapshotter-6f6bf9c757-xwmmn 0/1 CrashLoopBackOff 141 (2m14s ago) 15h
csi-provisioner-557c7f7c44-wd57k 0/1 CrashLoopBackOff 141 (2m5s ago) 15h
csi-resizer-545b7c64f5-8jxk4 0/1 CrashLoopBackOff 141 (61s ago) 15h
They are all unable to connect to unix://csi/csi.sock:
╰─$ kl logs csi-attacher-67f8c99bcd-jsj54
I1116 01:28:03.328201 1 main.go:97] Version: v4.4.0
W1116 01:28:13.329951 1 connection.go:183] Still connecting to unix:///csi/csi.sock
W1116 01:28:23.329624 1 connection.go:183] Still connecting to unix:///csi/csi.sock
W1116 01:28:33.330048 1 connection.go:183] Still connecting to unix:///csi/csi.sock
E1116 01:28:33.330124 1 main.go:136] context deadline exceeded
To Reproduce
Run negative test case Restart Cluster While Workload Heavy Writing repeatedly.
Expected behavior
Support bundle for troubleshooting
supportbundle_ae1d1892-8da7-4733-97d7-16326976bb0e_2023-11-16T03-17-09Z.zip
worker nodes logs: worker_nodes_log.txt
Environment
- Longhorn version: v1.5.x-head
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
- Number of management node in the cluster:
- Number of worker node in the cluster:
- Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
- Impacted Longhorn resources:
- Volume names:
Additional context
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 16 (13 by maintainers)
This didn’t help to reproduce the issue.
I’ll look into this a bit when I have time to see if there is an improvement we can make. It seems different from the original event, since, in that case, the CSI components never recovered.
ref: https://github.com/longhorn/longhorn/issues/7428
Reproduce steps:
kubectl edit -n longhorn-system service longhorn-backendand change the port from9500to9501. This effectively breaks thelonghorn-backendservice temporarily.kubectl delete -n longhorn-system pod longhorn-csi-plugin-<xxxxxx>.kubectl edit -n longhorn-system service longhorn-backendand restore the correct port.longhorn-csi-plugin. It can never recover (or at least not for a long time) and exhibits the behavior described above.In the example below, I kept the
longhorn-backendservice broken for6m30s. The situation was not improved7m30safter that.Deleting
longhorn-csi-plugin-6zxtnimmediately resolved the issue.