longhorn: [BUG] Restarting kubelet and upgrading longhorn pod get stuck in terminating

Describe the bug

A clear and concise description of what the bug is.

The pod with mounted pvc get stuck in terminating after the kubelet is restarted and longhorn is updated from 1.2.2 to 1.3.2 and a deployment mounting a pvc is changed.

Some background. There is one node in this cluster so 2/3 replicas are unschedulable.

To Reproduce

Steps to reproduce the behavior:

  1. Restarted the kubelet
  2. Upgraded from longhorn 1.2.2 to 1.3.2
  3. Deployment is changed.

Expected behavior

A clear and concise description of what you expected to happen.

Pod should not get stuck terminating.

Log or Support bundle

If applicable, add the Longhorn managers’ log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

longhorn-support-bundle_fb082841-3071-4192-9730-c69ffc6589d7_2022-10-25T20-04-44Z.zip

$ kubectl -n minio get pods
NAME                     READY   STATUS        RESTARTS   AGE
minio-6b85575bf4-s5vwv   0/1     Terminating   0          20m
$ kubectl -n minio describe pod minio-6b85575bf4-s5vwv
Name:                      minio-6b85575bf4-s5vwv
Namespace:                 minio
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        20m (x3 over 20m)  default-scheduler        0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled               20m                default-scheduler        Successfully assigned minio/minio-6b85575bf4-s5vwv to ethanm-longhorn-2
  Normal   SuccessfulAttachVolume  20m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c"
  Normal   Pulled                  20m                kubelet                  Container image "minio/minio:RELEASE.2020-01-25T02-50-51Z" already present on machine
  Normal   Created                 20m                kubelet                  Created container minio
  Normal   Started                 20m                kubelet                  Started container minio
  Warning  FailedMount             14m                kubelet                  MountVolume.MountDevice failed for volume "pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c" : rpc error: code = Internal desc = Get "http://longhorn-backend:9500/v1/volumes/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c": dial tcp 10.96.0.199:9500: connect: connection refused
  Warning  FailedMount             13m (x6 over 14m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c" : rpc error: code = InvalidArgument desc = volume pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c hasn't been attached yet
  Warning  FailedMount             13m                kubelet                  MountVolume.SetUp failed for volume "pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c" : rpc error: code = Internal desc = NodePublishVolume: failed to prepare mount point for volume pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c error unmounted existing corrupt mount point /var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/mount
  Warning  FailedMount             12m                kubelet                  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[kube-api-access-xhx2m data]: timed out waiting for the condition
  Normal   Killing                 12m                kubelet                  Stopping container minio

kubelet.log

Oct 25 20:01:08 ethanm-longhorn-2 kubelet[40684]: E1025 20:01:08.145253   40684 reconciler.go:198] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\") pod \"2c774bfe-7dca-470d-888f-5fab54af6b3a\" (UID: \"2c774bfe-7dca-470d-888f-5fab54af6b3a\") : UnmountVolume.NewUnmounter failed for volume \"pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\") pod \"2c774bfe-7dca-470d-888f-5fab54af6b3a\" (UID: \"2c774bfe-7dca-470d-888f-5fab54af6b3a\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/vol_data.json]: open /var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\") pod \"2c774bfe-7dca-470d-888f-5fab54af6b3a\" (UID: \"2c774bfe-7dca-470d-888f-5fab54af6b3a\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/vol_data.json]: open /var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/vol_data.json: no such file or directory"
$ sudo ls /var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/
$

Environment

  • Longhorn version: 1.2.2 to 1.3.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubeadm
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 0
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster: 1

Additional context

Add any other context about the problem here.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 20 (8 by maintainers)

Most upvoted comments

I am able to repro 2/3 times with the attached gist.

https://gist.github.com/emosbaugh/d1a50c2f749d9ec30c2d4ab085a52b68

perhaps the test case is invalid since i can only repro by re-running kubeadm init and im not sure if this is supported although the cluster is not recreated and continues to work as expected other than longhorn. i cannot by simply restarting kubelet.

sudo kubeadm init --config=kubeadm.conf --ignore-preflight-errors=all
ethan@ethanm-longhorn-8:~$ kubectl get pod
NAME                               READY   STATUS        RESTARTS   AGE
nginx-deployment-9978d657c-s4w25   0/1     Terminating   0          16m
ethan@ethanm-longhorn-8:~$ uname -r
5.15.0-1021-gcp
ethan@ethanm-longhorn-8:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Ran into this vol_data.json missing again for the third time now. This is a critical issue. Please note it has nothing to do with upgrading longhorn as the ticket title suggest, it happens randomly over time.

Hello, also encountered this issue (twice) in a week of pods not being deleted and logs flooded with vol_data.json missing message. So it’s still an active issue.

3 Ubuntu nodes microk8s 1.27.5 Longhorn 1.5.1 installed via Helm

I’ll point out to kubernetes/kubernetes#116847 , if you look near the end (my name) I attached a 5minutes window log file of the kubelite process when the error occurred, and user @gnufied pointed out two lines in the logs that seems to point to a bad behavior coming from Longhorn.