longhorn: [BUG] Unable to attach or mount volumes: unmounted volumes=[volv], unattached volumes=[volv kube-api-access-4tqrk]: timed out waiting for the condition (duplicated default IM-R)

Describe the bug

A clear and concise description of what the bug is.

To Reproduce

Steps to reproduce the behavior:

  1. i did operation on cluster like this create-volumes
  2. it showed error
Events:
  Type     Reason              Age               From                     Message
  ----     ------              ----              ----                     -------
  Normal   Scheduled           2m45s             default-scheduler        Successfully assigned default/volume-test to release-worker01
  Warning  FailedMount         43s               kubelet                  Unable to attach or mount volumes: unmounted volumes=[volv], unattached volumes=[volv kube-api-access-4tqrk]: timed out waiting for the condition
  Warning  FailedAttachVolume  5s (x8 over 73s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-de440081-ecb8-4ab2-b5da-9aea75b19003" : rpc error: code = DeadlineExceeded desc = volume pvc-de440081-ecb8-4ab2-b5da-9aea75b19003 failed to attach to node release-worker01

Expected behavior

i hope it can be successful A clear and concise description of what you expected to happen.

Log or Support bundle

kubectl describe pod volume-test
  Type     Reason              Age               From                     Message
  ----     ------              ----              ----                     -------
  Normal   Scheduled           2m45s             default-scheduler        Successfully assigned default/volume-test to release-worker01
  Warning  FailedMount         43s               kubelet                  Unable to attach or mount volumes: unmounted volumes=[volv], unattached volumes=[volv kube-api-access-4tqrk]: timed out waiting for the condition
  Warning  FailedAttachVolume  5s (x8 over 73s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-de440081-ecb8-4ab2-b5da-9aea75b19003" : rpc error: code = DeadlineExceeded desc = volume pvc-de440081-ecb8-4ab2-b5da-9aea75b19003 failed to attach to node release-worker01
---
daemonset.apps/longhorn-environment-check created
waiting for pods to become ready (0/3)
waiting for pods to become ready (0/3)
waiting for pods to become ready (0/3)
waiting for pods to become ready (0/3)
waiting for pods to become ready (0/3)
all pods ready (3/3)

  MountPropagation is enabled!

cleaning up...
daemonset.apps "longhorn-environment-check" deleted
clean up complete
---
[root@release-master engine-binaries]# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                       STORAGECLASS   REASON   AGE
sopei-log                                  4Gi        RWX            Retain           Bound    sopei-biz/sopei-log         sopei-log               8d
pvc-de440081-ecb8-4ab2-b5da-9aea75b19003   2Gi        RWO            Delete           Bound    default/longhorn-volv-pvc   longhorn                7m53s
[root@release-master engine-binaries]# kubectl get pvc
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
longhorn-volv-pvc   Bound    pvc-de440081-ecb8-4ab2-b5da-9aea75b19003   2Gi        RWO            longhorn       7m58s

image

Environment

  • Longhorn version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
    • Number of management node in the cluster: 1.21.7+k3s1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: centos7.9
    • CPU per node: 4
    • Memory per node: 8g
    • Disk type(e.g. SSD/NVMe): NVMe
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): tencent cloud
  • Number of Longhorn volumes in the cluster: 3

Additional context

Add any other context about the problem here. longhorn-support-bundle

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 43 (17 by maintainers)

Most upvoted comments

The mkfs.xfs in Longhorn csi plugin is the newer version, mkfs.xfs version 5.3.0. The filesystem created by this version will not work on RHEL 7 by default.

There is manually workaround for this but the best solution would be just ask user to upgrade to CentOS 8. There is another benefit of this approach is that we can avoid this slowness in the older kernel version https://github.com/longhorn/longhorn/issues/2640

Ref: https://github.com/ceph/ceph-csi/issues/966#issuecomment-620655796

I had the Centos 7. 9 and i chenged to debian 11.4 and added nfs-common package

Node release-worker01 doesn’t have enough available space:

          Schedulable:
            type: Schedulable
            status: "False"
            lastprobetime: ""
            lasttransitiontime: "2022-03-24T03:49:02Z"
            reason: DiskPressure
            message: the disk default-disk-47d795d8889d00d3(/var/lib/longhorn/) on
              the node release-worker01 has 21915238400 available, but requires reserved
              31666128076, minimal 25% to schedule more replicas

Please do:

  1. Using Longhorn UI. Go to nodes tap
  2. Disable scheduling for release-worker01
  3. ssh into the node release-worker-01
  4. go to /var/lib/longhorn/replicas
  5. Clean up all the directory in that folder (because there is no active replicas that Longhorn is aware of on this node)
  6. Using Longhorn UI. Go to nodes tap
  7. Enable scheduling for release-worker01
  8. Longhorn will rebuild the replicas

After checked the support bundle, we see that:

  • volume cannot finish attaching because the one or more replica cannot be started
  • Replica cannot be started because Longhorn cannot find the only instance manager for it. A lot of errors: 2022-03-18T14:01:30.381993257+08:00 E0318 06:01:30.381849 1 replica_controller.go:201] fail to sync replica for longhorn-system/pvc-fd6f5ab7-0f18-4c4f-b099-8c7afd23ccba-r-4a7b6a1b: failed to get instance manager for instance pvc-fd6f5ab7-0f18-4c4f-b099-8c7afd23ccba-r-4a7b6a1b: can not find the only available instance manager for instance pvc-fd6f5ab7-0f18-4c4f-b099-8c7afd23ccba-r-4a7b6a1b, node release-worker01, instance manager image rancher/mirrored-longhornio-longhorn-instance-manager:v1_20211210, type replica
  • Longhorn cannot find the only instance manager because there are multiple duplicated default instance manager on node release-worker01

Workaround: delete one of the duplicated instance manager kubectl delete instancemanagers instance-manager-r-84356b81 -n longhorn-system

Ref: This is related to the issue https://github.com/longhorn/longhorn/issues/3000

Got into the same issue. I resolved it by killing the associated VolumeAttachment and then restart the pod.

i resolved my problem @PhanLe1010