longhorn: [BUG] "failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"
Describe the bug
After upgrading to v1.1.1 I am now having the longhorn-driver-deployer pod continually crash, never finishing.
To Reproduce
Sadly, I don’t know of a specific way to reproduce this; I was having issues with my cluster when I upgraded, which didn’t show up until after I started the upgrade; my calico configuration was using ip addresses from the wrong interface which caused some really odd issues throughout the cluster and I ended up force killing some of the pods. I suspect that contributed to getting into this state, but I have no idea how to resolve the issue at this point.
Expected behavior
Obviously the driver deployer should not crash and should complete as expected =] Since I don’t actually know what it does I’m not sure what that means, really.
Log
2021/05/03 18:27:36 proto: duplicate proto type registered: VersionResponse
W0503 18:27:36.570867 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2021-05-03T18:27:36Z" level=debug msg="Deploying CSI driver"
time="2021-05-03T18:27:36Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-03T18:27:37Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-03T18:27:38Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-03T18:27:39Z" level=info msg="Proc found: kubelet"
time="2021-05-03T18:27:39Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]"
time="2021-05-03T18:27:39Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2021-05-03T18:27:39Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-05-03T18:27:39Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-05-03T18:27:39Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-05-03T18:27:39Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-05-03T18:29:40Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-attacher: failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"
Environment:
- Longhorn version: v1.1.1 (coming from v1.1.1-rc1)
- Installation method: kubectl
- Kubernetes distro: kubeadm, v1.21.0
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 4 (plus the 3 management nodes which are also worker nodes)
- Node config
- OS type and version: Ubuntu 20.04.2 LTS
- CPU per node: varies
- Memory per node: varies
- Disk type(e.g. SSD/NVMe): SSD/NVMe, some magnetic in RAID 0 (which work well, btw)
- Network bandwidth between the nodes: 10GBe
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: ~15
Additional context
Not sure what else would be useful; I am available in the slack channel for discussion if that would help. I have things finally stable now, but would really like to have that fixed as I don’t know what issues it will cause.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 21 (10 by maintainers)
I encountered a similar issue with v1.3.1 on k3s 1.24. It’s possible that having FluxCD set to automatically upgrade the Helm chart broke something (I’d originally started with 1.2.3).
Eventually I deleted all of the csi resources one-by-one using kubectl until the deployer managed to finish. I ended up having to delete the deployments, replicasets and pods of the csi-attacher, csi-provisioner, csi-resizer and csi-snapshotter individually since the deletion wasn’t cascading from some reason. Then I deleted the associated services and finally the longhorn-csi-plugin daemonset and pods (again, the cascading delete was somehow broken).
ultimately I had too many issues with longhorn; lots I like about it, but it wasn’t reliable enough for my needs so I’ve switched to rook/ceph.
I did an upgrade test on microk8s v1.21.0 (v1.21.0-3+121713cef81e03) with commands:
However, I can’t reproduce it. 🤔
I removed those, then let the
longhorn-driver-deployerpod run again and it gave that error, presumably putting the deletionTimestamp and finalizers back.Is there something I can do to “kick” it while it’s “Waiting for foreground deletion of service csi-provisioner”? I tried actually deleting the csi-provisioner service, but that didn’t help – and I didn’t see it get recreated, so I put it back.
It didn’t finish and is now back doing the same as it originally did; unfortunately I didn’t catch the logs before the container restarted, so I’ll need to try it again when I can keep my terminal open until it dies.
I removed the deletionTimestamp and finalizers; it seems to have made it further, so we’ll see if it finishes this time =]