longhorn: [BUG] Facing data loss in longhorn volume

Describe the bug (🐛 if you encounter this issue)

I am using longhorn 1.3.1. I have created one volume with two longhorn replicas and uploaded data in volume. After sometime, data disappear from the volumes.

To Reproduce

Expected behavior

Data should not be lost.

Log or Support bundle

In instance manager, I am seeing volume-head-xxx missing error

instance-manager-r-cf5b560f/replica-manager.log:19037:2023-03-16T05:46:11.836582701Z time="2023-03-16T05:46:11Z" level=warning msg="Failed to open server: 10.42.3.117:10005, Retrying..."
instance-manager-r-cf5b560f/replica-manager.log:19060:2023-03-16T05:48:56.277460168Z [pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-r-44b09aef] time="2023-03-16T05:48:56Z" level=error msg="Fail to get size of file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-head-002.img: no such file or directory"
instance-manager-r-cf5b560f/replica-manager.log:19061:2023-03-16T05:48:56.277518561Z time="2023-03-16T05:48:56Z" level=error msg="Fail to get file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-head-002.img size"
instance-manager-r-cf5b560f/replica-manager.log:19062:2023-03-16T05:48:56.277526947Z time="2023-03-16T05:48:56Z" level=error msg="Fail to get size of file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-snap-s-13f4817e-7e50-49e6-b0ca-3f505e50d890.img: no such file or directory"
instance-manager-r-cf5b560f/replica-manager.log:19063:2023-03-16T05:48:56.285144257Z [pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-r-44b09aef] time="2023-03-16T05:48:56Z" level=error msg="Fail to get size of file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-snap-046e9650-f84c-4f9c-9d2e-77ac9471b9b0.img: no such file or directory"
instance-manager-r-cf5b560f/replica-manager.log:19064:2023-03-16T05:48:56.285163525Z time="2023-03-16T05:48:56Z" level=error msg="Fail to head file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-head-002.img stat, err no such file or directory"

Environment

  • Longhorn version: 1.3.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE 1.22
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 0
  • Node config
    • OS type and version: RHEL 8.7
    • CPU per node: 15
    • Memory per node: 64G
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 1

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (10 by maintainers)

Most upvoted comments

@derekbit one last confirmation, is it actually the jq issue or the daemonset has become irrelevant because of this PR https://github.com/longhorn/longhorn-manager/pull/1413/files ?

The issue is caused by missing jq and not related to https://github.com/longhorn/longhorn-manager/pull/1413/files .

Thanks @derekbit for looking into this.

Thanks a lot @derekbit for looking into it at short notice. We have deleted that daemonset and hydrating the container registry again to validate if it indeed solves the issue

We will look into orphaned replica folder cleanup feature as well

Thanks to @innobead as well

BTW, v1.3.1 supports the orphaned replica directories cleanup. Why not try this feature instead?

/bin/bash: line 8: jq: command not found
Replica not found but Volume found with a valid status (robust status healthy). Data directory /datadisk//replicas/pvc-b56daf56-0332-47de-9b00-1855853b58c6-a7754531/ can be deleted

jq is not found in the system, the replica_name will be always empty. Then, it hits

                    echo "Replica not found but Volume found with a valid status (robust status ${robust_status}). Data directory $dir can be deleted"
                    rm -rf $dir

That means the replica directory is always deleted even though the volume is healthy and the replica is in use.

@derekbit looks like this PR has made that script irrelevant? https://github.com/longhorn/longhorn-manager/pull/1413/files

Can you please confirm

This is irrelevant to this issue.

@rajivml feel free to reopen this if the issue is still.

BTW, suggest following each release note, so you will better understand what new features/improvements/bugfixes introduced.

@derekbit I am not seeing issue right now in pvc-a15c8d7a-c52f-4e71-8a8c-a5465d2da719-f554eeca. This PVC is new. I recreated it post data loss.

Can you reproduce the issue easily? If YES, can you provide the latest support bundle and dmesg of each node of the latest result.

Cool. Thank you.

I will try to repro. I have dmesg logs from previous run. If you want, I can share the same.

@mynktl I want the dmesg and check them together. Thank you.