longhorn: [BUG] Restarting kubelet and upgrading longhorn pod get stuck in terminating
Describe the bug
A clear and concise description of what the bug is.
The pod with mounted pvc get stuck in terminating after the kubelet is restarted and longhorn is updated from 1.2.2 to 1.3.2 and a deployment mounting a pvc is changed.
Some background. There is one node in this cluster so 2/3 replicas are unschedulable.
To Reproduce
Steps to reproduce the behavior:
- Restarted the kubelet
- Upgraded from longhorn 1.2.2 to 1.3.2
- Deployment is changed.
Expected behavior
A clear and concise description of what you expected to happen.
Pod should not get stuck terminating.
Log or Support bundle
If applicable, add the Longhorn managers’ log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
longhorn-support-bundle_fb082841-3071-4192-9730-c69ffc6589d7_2022-10-25T20-04-44Z.zip
$ kubectl -n minio get pods
NAME READY STATUS RESTARTS AGE
minio-6b85575bf4-s5vwv 0/1 Terminating 0 20m
$ kubectl -n minio describe pod minio-6b85575bf4-s5vwv
Name: minio-6b85575bf4-s5vwv
Namespace: minio
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20m (x3 over 20m) default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Normal Scheduled 20m default-scheduler Successfully assigned minio/minio-6b85575bf4-s5vwv to ethanm-longhorn-2
Normal SuccessfulAttachVolume 20m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c"
Normal Pulled 20m kubelet Container image "minio/minio:RELEASE.2020-01-25T02-50-51Z" already present on machine
Normal Created 20m kubelet Created container minio
Normal Started 20m kubelet Started container minio
Warning FailedMount 14m kubelet MountVolume.MountDevice failed for volume "pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c" : rpc error: code = Internal desc = Get "http://longhorn-backend:9500/v1/volumes/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c": dial tcp 10.96.0.199:9500: connect: connection refused
Warning FailedMount 13m (x6 over 14m) kubelet MountVolume.MountDevice failed for volume "pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c" : rpc error: code = InvalidArgument desc = volume pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c hasn't been attached yet
Warning FailedMount 13m kubelet MountVolume.SetUp failed for volume "pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c" : rpc error: code = Internal desc = NodePublishVolume: failed to prepare mount point for volume pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c error unmounted existing corrupt mount point /var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/mount
Warning FailedMount 12m kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[kube-api-access-xhx2m data]: timed out waiting for the condition
Normal Killing 12m kubelet Stopping container minio
kubelet.log
Oct 25 20:01:08 ethanm-longhorn-2 kubelet[40684]: E1025 20:01:08.145253 40684 reconciler.go:198] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\") pod \"2c774bfe-7dca-470d-888f-5fab54af6b3a\" (UID: \"2c774bfe-7dca-470d-888f-5fab54af6b3a\") : UnmountVolume.NewUnmounter failed for volume \"pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\") pod \"2c774bfe-7dca-470d-888f-5fab54af6b3a\" (UID: \"2c774bfe-7dca-470d-888f-5fab54af6b3a\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/vol_data.json]: open /var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c\") pod \"2c774bfe-7dca-470d-888f-5fab54af6b3a\" (UID: \"2c774bfe-7dca-470d-888f-5fab54af6b3a\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/vol_data.json]: open /var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/pvc-75cd9679-a0ea-4801-8ff6-f53faae2676c/vol_data.json: no such file or directory"
$ sudo ls /var/lib/kubelet/pods/2c774bfe-7dca-470d-888f-5fab54af6b3a/volumes/kubernetes.io~csi/
$
Environment
- Longhorn version: 1.2.2 to 1.3.2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubeadm
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 0
- Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster: 1
Additional context
Add any other context about the problem here.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 20 (8 by maintainers)
I am able to repro 2/3 times with the attached gist.
https://gist.github.com/emosbaugh/d1a50c2f749d9ec30c2d4ab085a52b68
perhaps the test case is invalid since i can only repro by re-running kubeadm init and im not sure if this is supported although the cluster is not recreated and continues to work as expected other than longhorn. i cannot by simply restarting kubelet.
Ran into this
vol_data.jsonmissing again for the third time now. This is a critical issue. Please note it has nothing to do with upgrading longhorn as the ticket title suggest, it happens randomly over time.Hello, also encountered this issue (twice) in a week of pods not being deleted and logs flooded with
vol_data.jsonmissing message. So it’s still an active issue.3 Ubuntu nodes microk8s 1.27.5 Longhorn 1.5.1 installed via Helm
I’ll point out to kubernetes/kubernetes#116847 , if you look near the end (my name) I attached a 5minutes window log file of the kubelite process when the error occurred, and user @gnufied pointed out two lines in the logs that seems to point to a bad behavior coming from Longhorn.