longhorn: [BUG] Longhorn 1.2.0 unable to handle instance-manager failure

Describe the bug Couple of days ago i start observing some POD failures. Only PODs which were using longhorn block storage was affected. After some debugging it looks that k8s/longhorn is not able to handle instance-manager failure properly even when StorageClass have more than 1 replica.

When instance-manager will die, PODs attached to it, will not be able to recover.

POD will stay in CrashLoopBackOff because of I/O errors from volume.

To Reproduce Steps to reproduce the behavior:

  1. Deploy POD which will need PV to properly start (e.g config file on PV)
  2. Kill instance-manager serving PV for POD (kubectl delete -n longhorn-system --force --grace-period=0 instance-manager-…
  3. Wait till new instance-manager will be spawned
  4. Check if POD was recovered

Expected behavior After instance-manager failure, POD should be restarted and able to use volume.

Environment:

  • Longhorn version: 1.2.0
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s v1.21.4+k3s1
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 20.04
    • CPU per node: Intel i3/i5
    • Memory per node: 8GB
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes: 1Gbit
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: ~10

Automatically Delete Workload Pod when The Volume Is Detached Unexpectedly: true Pod Deletion Policy When Node is Down: delete-both-statefulset-and-deployment-pod

Additional context It looks like some race condition. On HV i can see those I/O errors, so looks like OS still have volume mounted from failed instance-manager.

root@k3s-delta:~# crictl inspect 22644f59333cd |grep volumes
        "hostPath": "/var/lib/kubelet/pods/37667d1d-522d-4dd2-8dc7-38506cda3347/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount",
        "hostPath": "/var/lib/kubelet/pods/37667d1d-522d-4dd2-8dc7-38506cda3347/volumes/kubernetes.io~projected/kube-api-access-qvrbx",
          "host_path": "/var/lib/kubelet/pods/37667d1d-522d-4dd2-8dc7-38506cda3347/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount"
          "host_path": "/var/lib/kubelet/pods/37667d1d-522d-4dd2-8dc7-38506cda3347/volumes/kubernetes.io~projected/kube-api-access-qvrbx",
          "source": "/var/lib/kubelet/pods/37667d1d-522d-4dd2-8dc7-38506cda3347/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount",
          "source": "/var/lib/kubelet/pods/37667d1d-522d-4dd2-8dc7-38506cda3347/volumes/kubernetes.io~projected/kube-api-access-qvrbx",

root@k3s-delta:~# ls /var/lib/kubelet/pods/37667d1d-522d-4dd2-8dc7-38506cda3347/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount
ls: reading directory '/var/lib/kubelet/pods/37667d1d-522d-4dd2-8dc7-38506cda3347/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount': Input/output error

root@k3s-delta:~# df -h |grep eb67d196
/dev/longhorn/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b  2.0G  351M  1.6G  19% /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/globalmount

If you delete failed POD, nothing will change. To fix problem i need to scale deployment to 0 and after that scale it back to previous number. I would guess, that scaling is performing some cleanup on HV.

Volume status after deployment scale=0 -> scale=1

root@k3s-delta:~# crictl inspect c6ccfe69eab33 |grep volume
        "hostPath": "/var/lib/kubelet/pods/95e77a5e-e0a2-4c38-bfb2-23e2dea9c1dc/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount",
        "hostPath": "/var/lib/kubelet/pods/95e77a5e-e0a2-4c38-bfb2-23e2dea9c1dc/volumes/kubernetes.io~projected/kube-api-access-6jzb7",
          "host_path": "/var/lib/kubelet/pods/95e77a5e-e0a2-4c38-bfb2-23e2dea9c1dc/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount"
          "host_path": "/var/lib/kubelet/pods/95e77a5e-e0a2-4c38-bfb2-23e2dea9c1dc/volumes/kubernetes.io~projected/kube-api-access-6jzb7",
          "source": "/var/lib/kubelet/pods/95e77a5e-e0a2-4c38-bfb2-23e2dea9c1dc/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount",
          "source": "/var/lib/kubelet/pods/95e77a5e-e0a2-4c38-bfb2-23e2dea9c1dc/volumes/kubernetes.io~projected/kube-api-access-6jzb7",

root@k3s-delta:~# ls /var/lib/kubelet/pods/95e77a5e-e0a2-4c38-bfb2-23e2dea9c1dc/volumes/kubernetes.io~csi/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/mount
[some files from volume, without IO errors]

root@k3s-delta:~# df -h |grep eb67d196
/dev/longhorn/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b  2.0G  351M  1.6G  19% /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-eb67d196-a204-442f-a4a4-2bb0c127572b/globalmount

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (13 by maintainers)

Most upvoted comments

This issue is probably a side effect of the CSI Node Staging Capability introducing: When Longhorn detects the volume crash caused by instance manager pod deletion, it will restart the workload pod as long as one setting is enabled. The deployment pod is probably restarted on the same node hence Kubernetes would issue any unstage or detach call for the corresponding volume during the old pod cleanup. Instead, it will directly do umount for the old path as well as issuing mount calls only for the new pod. But here the staging path, which is the mount point, is already corrupted. Then the mount calls (between the staging path and the volume mount point) for the new pod are meaningless/ corrupted as well.

I tried to add validation for the staging path in Longhorn CSI API NodePublishVolume based on the spec doc but kubelet won’t work as described in the doc (re-send NodeStageVolume calls after receiving the specific error code). We need to find a way to ask kubelet to re-do the staging. Currently I cannot find a way to make kubelet issue the unstage then stage calls in this case.

cc @joshimoo

Close this issue Verified on master head 20220503

  1. This case was passed in e2e master branch pipeline
  2. This case was passed in e2e v1.2.x branch pipeline
  3. Manually verifed pass on local environment

I could reproduce it even without --force --grace-period=0.

The deployment Pods be restarted by Longhorn-manager, but even after Pod restart, the Pod is still in I/O error state.

My env is v1.2.1-rc2.

Update: it’s not reproducible at v1.1.2

@jenting

  1. Not really as i already migrated my storage from longhorn to ceph
  2. I had this issue with any POD which required PV to successfully start. This should work for you
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: debugpod
  labels:
    app.kubernetes.io/name: debugpod
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app.kubernetes.io/name: debugpod
  template:
    metadata:
      labels:
        app.kubernetes.io/name: debugpod
    spec:
      containers:
      - name: debugpod
        image: ubuntu:latest
        command:
          - "/bin/sh"
          - "-ec"
          - |
            touch /data/test
            tail -f /data/test
        volumeMounts:
          - mountPath: /data
            name: data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: test-data
  1. HV - hypervisor, baremetal node on which k8s is running
  2. No

@joshimoo To be honest i didnt try to gracefuly delete instance-manager POD. I was testing what is happening with my workloads when instance-manager is unexpectedly killed. All Deployments with this issue had 1 POD.

In free time i will try to bring up second k8s cluster with longhorn and provide more data. Cant promise when exactly. Right now i would suggest to just spawn one POD as described in this ticket and just kill instance-manager - you should be able to reproduce.

If interested, my whole k8s setup is available as ansible playbook https://github.com/bkupidura/home-k8s. I would expect that if you deploy vanilla ubuntu, provision k3s and render longhorn manifest you will endup with exact copy of my environment.