longhorn: [QUESTION] How to rescue Faulted volumes

Question

After a node crash, I end up with two faulted volumes that can not be attached, they can’t be mounted to /dev/longhorn/, but the longhorn pods works fine, and the image files looks fine to me.

time="2022-07-22T09:01:17Z" level=info msg="All replicas are failed, set engine salvageRequested to true" 
accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=node1 owner=node1 
state=detached volume=pvc-3cc715b2-aaa2-4c1d-a788-ffc71905874c                                                                                                                                                                                                                                           

time="2022-07-22T09:01:17Z" level=info msg="All replicas are failed, set engine salvageRequested to true" 
accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=false node=node1 owner=node1 
shareEndpoint= shareState=stopped state=detached volume=pvc-04e953eb-5411-4433-82a4-e6e54aa7fb92

I wonder if I can try to fsck or somehow repair the filesystem?

Environment

  • Longhorn version: 1.2.4 (recently upgraded from 1.2.3)
  • Kubernetes version: v1.24.2+k3s2
  • Node config
    • OS type and version debian bulleye
    • CPU per node: 20
    • Memory per node: 128G
    • Disk type RAID10 HDD
    • Network bandwidth and latency between the nodes: singleton node
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): K3S on debian

Additional context

I have another volume can be mounted without errors.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (7 by maintainers)

Most upvoted comments

After lower the minimal storage percentage, the node is back to schedulable, and the faulted disks back to degraded but attachable,

image

now I can see it is attached to /dev/longhorn, but I can not use it in my pods, the pods output is

 Normal   SuccessfulAttachVolume  3m26s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-3cc715b2-aaa2-4c1d-a788-ffc71905874c"                                                                                       │
│   Warning  FailedMount             85s                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[vol-01vd4], unattached volumes=[volume-localtime vol-xqhgx vol-vzmve vol-iwxcc vol-01vd4 vol-bhyma vol-zqby7 vol-gz │
│ wz volume-devmem vol-z7i0r vol-6rzut vol-5v1jh kube-api-access-9dfc8[]: timed out waiting for the condition                                                                                                                                                │
│   Warning  FailedMount             70s (x9 over 3m20s)  kubelet                  MountVolume.Setup failed while expanding volume for volume "pvc-3cc715b2-aaa2-4c1d-a788-ffc71905874c" : Expander.NodeExpand found CSI plugin kubernetes.io/csi/driver.lon │
│ ghorn.io to not support node expansion

@PhanLe1010 Oh yes, I have a lot of other data using local-path provision, I am evaluating switching to longhorn.

Can you manually try to salvage the volume by:

  1. Make sure the node node1 and the disk /data/longhorn are schedulable
  2. Scale down the statefulset lab-shihs to 0
  3. Goto Longhorn UI -> click on the volume to go to volume detail page -> click on the top right menu bar -> select salvage replica -> select the replica
  4. Scale up the statefulset lab-shihs to 0

On the other hand, can you check the setting Automatic salvage in Longhorn UI -> setting -> general setting to see if it is ON

The disk is not schedulable, it alerts this message:

Last Transition Time: 17 days ago
Message: the disk default-disk-7f5c98b5a858e751(/data/longhorn) on the node node1 has 4531630899200 available, but requires reserved 322122547200, minimal 25% to schedule more replicas
Reason: DiskPressure
Status: False

However 322122547200/4531630899200 = 0.071, which is enough for the minimal.

When try to salvage the replica, the output is

unable to salvage volume pvc-04e953eb-5411-4433-82a4-e6e54aa7fb92: Disk with UUID 10540b0a-c68e-43f9-8dd0-91dc4cca4e1c on node node1 is unschedulable for replica pvc-04e953eb-5411-4433-82a4-e6e54aa7fb92-r-0eeb2c9e

And the Automatic salvage setting is ON.

Thank you for your help.

Can you manually try to salvage the volume by:

  1. Make sure the node node1 and the disk /data/longhorn are schedulable
  2. Scale down the statefulset lab-shihs to 0
  3. Goto Longhorn UI -> click on the volume to go to volume detail page -> click on the top right menu bar -> select salvage replica -> select the replica
  4. Scale up the statefulset lab-shihs to 0

On the other hand, can you check the setting Automatic salvage in Longhorn UI -> setting -> general setting to see if it is ON