longhorn: [BUG] Facing data loss in longhorn volume
Describe the bug (🐛 if you encounter this issue)
I am using longhorn 1.3.1. I have created one volume with two longhorn replicas and uploaded data in volume. After sometime, data disappear from the volumes.
To Reproduce
Expected behavior
Data should not be lost.
Log or Support bundle
In instance manager, I am seeing volume-head-xxx missing error
instance-manager-r-cf5b560f/replica-manager.log:19037:2023-03-16T05:46:11.836582701Z time="2023-03-16T05:46:11Z" level=warning msg="Failed to open server: 10.42.3.117:10005, Retrying..."
instance-manager-r-cf5b560f/replica-manager.log:19060:2023-03-16T05:48:56.277460168Z [pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-r-44b09aef] time="2023-03-16T05:48:56Z" level=error msg="Fail to get size of file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-head-002.img: no such file or directory"
instance-manager-r-cf5b560f/replica-manager.log:19061:2023-03-16T05:48:56.277518561Z time="2023-03-16T05:48:56Z" level=error msg="Fail to get file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-head-002.img size"
instance-manager-r-cf5b560f/replica-manager.log:19062:2023-03-16T05:48:56.277526947Z time="2023-03-16T05:48:56Z" level=error msg="Fail to get size of file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-snap-s-13f4817e-7e50-49e6-b0ca-3f505e50d890.img: no such file or directory"
instance-manager-r-cf5b560f/replica-manager.log:19063:2023-03-16T05:48:56.285144257Z [pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-r-44b09aef] time="2023-03-16T05:48:56Z" level=error msg="Fail to get size of file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-snap-046e9650-f84c-4f9c-9d2e-77ac9471b9b0.img: no such file or directory"
instance-manager-r-cf5b560f/replica-manager.log:19064:2023-03-16T05:48:56.285163525Z time="2023-03-16T05:48:56Z" level=error msg="Fail to head file /host/datadisk/replicas/pvc-bb5493c3-85ba-44e8-b3b9-a30531e29e32-db63bf7c/volume-head-002.img stat, err no such file or directory"
Environment
- Longhorn version: 1.3.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE 1.22
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 0
- Node config
- OS type and version: RHEL 8.7
- CPU per node: 15
- Memory per node: 64G
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 1
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (10 by maintainers)
The issue is caused by missing jq and not related to https://github.com/longhorn/longhorn-manager/pull/1413/files .
Thanks @derekbit for looking into this.
Thanks a lot @derekbit for looking into it at short notice. We have deleted that daemonset and hydrating the container registry again to validate if it indeed solves the issue
We will look into orphaned replica folder cleanup feature as well
Thanks to @innobead as well
BTW, v1.3.1 supports the orphaned replica directories cleanup. Why not try this feature instead?
jqis not found in the system, the replica_name will be always empty. Then, it hitsThat means the replica directory is always deleted even though the volume is healthy and the replica is in use.
This is irrelevant to this issue.
@rajivml feel free to reopen this if the issue is still.
BTW, suggest following each release note, so you will better understand what new features/improvements/bugfixes introduced.
Cool. Thank you.
@mynktl I want the dmesg and check them together. Thank you.