longhorn: [QUESTION] How to rescue Faulted volumes

Question

After a node crash, I end up with two faulted volumes that can not be attached, they can’t be mounted to /dev/longhorn/, but the longhorn pods works fine, and the image files looks fine to me.

time="2022-07-22T09:01:17Z" level=info msg="All replicas are failed, set engine salvageRequested to true" 
accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=node1 owner=node1 
state=detached volume=pvc-3cc715b2-aaa2-4c1d-a788-ffc71905874c                                                                                                                                                                                                                                           

time="2022-07-22T09:01:17Z" level=info msg="All replicas are failed, set engine salvageRequested to true" 
accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=false node=node1 owner=node1 
shareEndpoint= shareState=stopped state=detached volume=pvc-04e953eb-5411-4433-82a4-e6e54aa7fb92

I wonder if I can try to fsck or somehow repair the filesystem?

Environment

Longhorn version: 1.2.4 (recently upgraded from 1.2.3)
Kubernetes version: v1.24.2+k3s2
Node config
- OS type and version debian bulleye
- CPU per node: 20
- Memory per node: 128G
- Disk type RAID10 HDD
- Network bandwidth and latency between the nodes: singleton node
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): K3S on debian

Additional context

I have another volume can be mounted without errors.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 16 (7 by maintainers)

Most upvoted comments

After lower the minimal storage percentage, the node is back to schedulable, and the faulted disks back to degraded but attachable,

now I can see it is attached to /dev/longhorn, but I can not use it in my pods, the pods output is

 Normal   SuccessfulAttachVolume  3m26s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-3cc715b2-aaa2-4c1d-a788-ffc71905874c"                                                                                       │
│   Warning  FailedMount             85s                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[vol-01vd4], unattached volumes=[volume-localtime vol-xqhgx vol-vzmve vol-iwxcc vol-01vd4 vol-bhyma vol-zqby7 vol-gz │
│ wz volume-devmem vol-z7i0r vol-6rzut vol-5v1jh kube-api-access-9dfc8[]: timed out waiting for the condition                                                                                                                                                │
│   Warning  FailedMount             70s (x9 over 3m20s)  kubelet                  MountVolume.Setup failed while expanding volume for volume "pvc-3cc715b2-aaa2-4c1d-a788-ffc71905874c" : Expander.NodeExpand found CSI plugin kubernetes.io/csi/driver.lon │
│ ghorn.io to not support node expansion

@PhanLe1010 Oh yes, I have a lot of other data using local-path provision, I am evaluating switching to longhorn.

simsicon on Aug 5, 2022

Can you manually try to salvage the volume by:

Make sure the node node1 and the disk /data/longhorn are schedulable

Scale down the statefulset lab-shihs to 0

Goto Longhorn UI -> click on the volume to go to volume detail page -> click on the top right menu bar -> select salvage replica -> select the replica

Scale up the statefulset lab-shihs to 0

On the other hand, can you check the setting Automatic salvage in Longhorn UI -> setting -> general setting to see if it is ON

The disk is not schedulable, it alerts this message:

Last Transition Time: 17 days ago
Message: the disk default-disk-7f5c98b5a858e751(/data/longhorn) on the node node1 has 4531630899200 available, but requires reserved 322122547200, minimal 25% to schedule more replicas
Reason: DiskPressure
Status: False

However 322122547200/4531630899200 = 0.071, which is enough for the minimal.

When try to salvage the replica, the output is

unable to salvage volume pvc-04e953eb-5411-4433-82a4-e6e54aa7fb92: Disk with UUID 10540b0a-c68e-43f9-8dd0-91dc4cca4e1c on node node1 is unschedulable for replica pvc-04e953eb-5411-4433-82a4-e6e54aa7fb92-r-0eeb2c9e

And the Automatic salvage setting is ON.

Thank you for your help.

simsicon on Aug 3, 2022

Can you manually try to salvage the volume by:

Make sure the node node1 and the disk /data/longhorn are schedulable
Scale down the statefulset lab-shihs to 0
Goto Longhorn UI -> click on the volume to go to volume detail page -> click on the top right menu bar -> select salvage replica -> select the replica
Scale up the statefulset lab-shihs to 0

On the other hand, can you check the setting Automatic salvage in Longhorn UI -> setting -> general setting to see if it is ON

PhanLe1010 on Jul 29, 2022