longhorn: [BUG] "failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"

Describe the bug After upgrading to v1.1.1 I am now having the longhorn-driver-deployer pod continually crash, never finishing.

To Reproduce

Sadly, I don’t know of a specific way to reproduce this; I was having issues with my cluster when I upgraded, which didn’t show up until after I started the upgrade; my calico configuration was using ip addresses from the wrong interface which caused some really odd issues throughout the cluster and I ended up force killing some of the pods. I suspect that contributed to getting into this state, but I have no idea how to resolve the issue at this point.

Expected behavior

Obviously the driver deployer should not crash and should complete as expected =] Since I don’t actually know what it does I’m not sure what that means, really.

Log

2021/05/03 18:27:36 proto: duplicate proto type registered: VersionResponse
W0503 18:27:36.570867       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-05-03T18:27:36Z" level=debug msg="Deploying CSI driver"
time="2021-05-03T18:27:36Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-03T18:27:37Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-03T18:27:38Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-03T18:27:39Z" level=info msg="Proc found: kubelet"
time="2021-05-03T18:27:39Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]"
time="2021-05-03T18:27:39Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2021-05-03T18:27:39Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-05-03T18:27:39Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-05-03T18:27:39Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-05-03T18:27:39Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-05-03T18:29:40Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-attacher: failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"

Environment:

  • Longhorn version: v1.1.1 (coming from v1.1.1-rc1)
  • Installation method: kubectl
  • Kubernetes distro: kubeadm, v1.21.0
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 4 (plus the 3 management nodes which are also worker nodes)
  • Node config
    • OS type and version: Ubuntu 20.04.2 LTS
    • CPU per node: varies
    • Memory per node: varies
    • Disk type(e.g. SSD/NVMe): SSD/NVMe, some magnetic in RAID 0 (which work well, btw)
    • Network bandwidth between the nodes: 10GBe
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: ~15

Additional context

Not sure what else would be useful; I am available in the slack channel for discussion if that would help. I have things finally stable now, but would really like to have that fixed as I don’t know what issues it will cause.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 21 (10 by maintainers)

Most upvoted comments

I encountered a similar issue with v1.3.1 on k3s 1.24. It’s possible that having FluxCD set to automatically upgrade the Helm chart broke something (I’d originally started with 1.2.3).

Eventually I deleted all of the csi resources one-by-one using kubectl until the deployer managed to finish. I ended up having to delete the deployments, replicasets and pods of the csi-attacher, csi-provisioner, csi-resizer and csi-snapshotter individually since the deletion wasn’t cascading from some reason. Then I deleted the associated services and finally the longhorn-csi-plugin daemonset and pods (again, the cascading delete was somehow broken).

ultimately I had too many issues with longhorn; lots I like about it, but it wasn’t reliable enough for my needs so I’ve switched to rook/ceph.

I did an upgrade test on microk8s v1.21.0 (v1.21.0-3+121713cef81e03) with commands:

kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1-rc1/deploy/longhorn.yaml
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1/deploy/longhorn.yaml

However, I can’t reproduce it. 🤔

I removed those, then let the longhorn-driver-deployer pod run again and it gave that error, presumably putting the deletionTimestamp and finalizers back.

Is there something I can do to “kick” it while it’s “Waiting for foreground deletion of service csi-provisioner”? I tried actually deleting the csi-provisioner service, but that didn’t help – and I didn’t see it get recreated, so I put it back.

richard@nebrask:~$ k -n longhorn-system logs -f longhorn-driver-deployer-6c945db7f6-gtf2g
2021/05/06 16:09:09 proto: duplicate proto type registered: VersionResponse
W0506 16:09:09.454949       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-05-06T16:09:09Z" level=debug msg="Deploying CSI driver"
time="2021-05-06T16:09:09Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-06T16:09:10Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-06T16:09:11Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-06T16:09:12Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-06T16:09:13Z" level=info msg="Proc found: kubelet"
time="2021-05-06T16:09:13Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]"
time="2021-05-06T16:09:13Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2021-05-06T16:09:13Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-05-06T16:09:13Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-05-06T16:09:13Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing service csi-attacher"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted service csi-attacher"
time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleted service csi-attacher in foreground"
time="2021-05-06T16:09:53Z" level=debug msg="Creating service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Created service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleting existing deployment csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleted deployment csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Deleted deployment csi-attacher in foreground"
time="2021-05-06T16:10:05Z" level=debug msg="Creating deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Created deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Deleting existing service csi-provisioner"
time="2021-05-06T16:10:05Z" level=debug msg="Deleted service csi-provisioner"
time="2021-05-06T16:10:05Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner"
time="2021-05-06T16:12:06Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-provisioner: failed to cleanup service csi-provisioner: Foreground deletion of service csi-provisioner timed out"

It didn’t finish and is now back doing the same as it originally did; unfortunately I didn’t catch the logs before the container restarted, so I’ll need to try it again when I can keep my terminal open until it dies.

I removed the deletionTimestamp and finalizers; it seems to have made it further, so we’ll see if it finishes this time =]

apiVersion: v1
kind: Service
metadata:
  annotations:
    driver.longhorn.io/kubernetes-version: v1.20.5
    driver.longhorn.io/version: v1.1.1
  creationTimestamp: "2021-05-02T09:27:02Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2021-05-02T10:55:18Z"
  finalizers:
  - foregroundDeletion
  labels:
    app: csi-attacher
    longhorn.io/managed-by: longhorn-manager
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:driver.longhorn.io/kubernetes-version: {}
          f:driver.longhorn.io/version: {}
        f:labels:
          .: {}
          f:app: {}
          f:longhorn.io/managed-by: {}
      f:spec:
        f:ports:
          .: {}
          k:{"port":12345,"protocol":"TCP"}:
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
        f:selector:
          .: {}
          f:app: {}
        f:sessionAffinity: {}
        f:type: {}
    manager: longhorn-manager
    operation: Update
    time: "2021-05-02T09:27:01Z"
  name: csi-attacher
  namespace: longhorn-system
  resourceVersion: "108540171"
  uid: f5ad7a41-8e0f-4e7f-a3d8-7b2aeb8d043b
spec:
  clusterIP: 10.109.227.49
  clusterIPs:
  - 10.109.227.49
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: dummy
    port: 12345
    protocol: TCP
    targetPort: 12345
  selector:
    app: csi-attacher
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}