longhorn: [BUG] Failed rebuilding a large replica of 3TB
Describe the bug I have a large PV (2 replicas) of 3TB of which data I just rsynced from another storage system; the size of data in the PV is 2.4TB. When I rebooted one node, and replica in the rebooted node always failed and another replica was created but never managed to finish. So I tested with reducing replica count to 1 and increasing again to 2, however rebuilding also failed. The CPU load in the source node was around 1 with a longhorn process running, but no data was actually transfer in the network during this time. When data was originally rsynced from another storage, I assume this snapshot is 2.4TB with no holes in the sparse file. Is the reason it fails because it’s too large? I have other PVs of 50GB size and have no issue rebuilding.
To Reproduce Steps to reproduce the behavior:
- Create a PV with 2 replicas of 3TB
- Rsync-ed a 2.4TB data
- Reduce replica count to 1, increase it again to 2
- Rebuilding ran for less than 1 hour, than failed
Expected behavior Replica should rebuild successfully.
Log If applicable, add the Longhorn managers’ log when the issue happens.
time="2021-04-18T06:00:54Z" level=error msg="Failed rebuilding of replica 10.42.1.207:10075" controller=longhorn-engine error="failed to add replica address='tcp://10.42.1.207:10075' to controller 'pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.1.0/longhorn [--url 10.42.2.65:10000 add tcp://10.42.1.207:10075], output , stderr, time=\"2021-04-18T05:13:17Z\" level=info msg=\"Adding replica tcp://10.42.1.207:10075 in WO mode\"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.2.127:10075 as the source for rebuild \"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.1.207:10075 as the target for rebuild \"\ntime=\"2021-04-18T06:00:54Z\" level=fatal msg=\"Error running add replica command: failed to sync files [{FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ActualSize:2593904701440} {FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ActualSize:0} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ActualSize:18315870208} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ActualSize:0} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ActualSize:34347081728} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ActualSize:0}] from tcp://10.42.2.127:10075: rpc error: code = Unavailable desc = transport is closing\"\n, error exit status 1" node=slave-sys2 volume=pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0
You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI. longhorn-support-bundle_a34f5b3a-be96-4afd-8271-6de8b0ebaa01_2021-04-18T06-33-04Z.zip
Environment:
- Longhorn version: 1.1.0
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 1.20.5+rke2r1
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 2
- Node config
- OS type and version: Ubuntu 20.04.2
- CPU per node: 32
- Memory per node: 256GB
- Disk type(e.g. SSD/NVMe): NVMe
- Network bandwidth between the nodes: 10Gbps
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: ~20
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 32 (21 by maintainers)
Rebuilding and snapshot merge showed different behaviour. For snapshot merge, I saw progress at 20%, hours later at 40%, etc. For rebuilding, the UI show “Rebuilding …” as in pic in previous comment, but there’s no network transfer for this snapshot, so it’s effectively at 0% for 8 hours.
The issue is at GetFiemapExtents function in the client.go. I downloaded this fiemap.c, and run this against the source snapshot. It really took 7.5h, as well as load at 1cpu, and the process is hard to kill (the same behaviour when longhorn was running during initial 8h):
File volume-snap-7432db6c-6aef-4bd7-901d-a4cd5ccea236.img has 1216060 extents: 0.00user 26854.01system 7:28:05elapsed 99%CPU (0avgtext+0avgdata 134160maxresident)k 50368inputs+0outputs (0major+33621minor)pagefaults 0swapsNvme disk rating with dbench is:
Random Read/Write IOPS: 260k/233k. BW: 3210MiB/s / 2675MiB/s Average Latency (usec) Read/Write: 99.53/25.30 Sequential Read/Write: 3394MiB/s / 2660MiB/s Mixed Random Read/Write IOPS: 166k/55.4kTo show it’s not the structure of the extents that is an issue, I created a sparse file using your method above (but with 4k data size), alternating between data and holes, so 4k data, 4k hole, 4k data, 4k hole, etc. Then I ran the fiemap on it, it took 6h.
To show it’s not an OS/disk issue, I did same on CentOS 8.2 and different ssd disk (the above ran on Ubuntu 20.04.2 and nvme disk), it took 7h.
Then I did the same on XFS filesystem. It took 1s!!
So, the ext4 based sparse file probably has to traverse all the trees of the extents while xfs seems to keep this info somewhere. I don’t know if mkfs.ext4 has options so it can do like xfs. But now I retested the 1.9T production data using xfs. Initial write from outside source is 30% slower, but snapshot merging now takes less than 2h. Rebuilding doesn’t have that 8h lead time. Normal read/write performance is similar to ext4. Parallel checksum for source and target snapshot will also be helpful to reduce rebuilding when target snapshot exists.
Thanks for the help with this troubleshooting, the rebuilding steps you mentioned above is really helpful.
Verified with v1.1.2-rc1
Validation - Pass
Cluster Config : 1 Control plane, 3 worker nodes (4 vcpus, 8GB RAM) Host OS : Ubuntu 20.04
File System : ext4
Test with files created on workload fs : Created test files inside of the workload filesystem. Used the dd command to generate the data with the
skipparameter. e.g.dd if=/dev/urandom of=file1.txt count=100 bs=1M conv=fsync skip=10kTest with above fio jobs : Used the above fio jobs to populate the data in the volume
File System : XFS Data Size : 10 Gi with 10k holes introduced after each block
Note: The checksum of data is compared after each rebuild across all the replicas.
Creation of the Test Volume.
FIO jobs: You can use Job10 as a template if you only want to populate a single volume. Then update the engines in between the rebuild tests. i.e. test a single rebuild at a time. This way you only need to run fio once to populate the volume. To use this attach the <volume-name> for job 10 it’s
rebuild-testto a random cluster node then run fio fio.job This populates the volume and creates the appropriate sparse file on the replicas.@joshimoo Thanks for explaining the expected behavior and testing steps in details. The result looks good as per validation comment above https://github.com/longhorn/longhorn/issues/2507#issuecomment-866550145
Hi. They’re degraded because default replica count is 3, but I only have 2 nodes at the moment (slave-sys2/3). Most of the PVs are in ssd disks and have no issues because most PVs have size less than 50G. The one failing rebuilding is on NVMe. NVMe size is 6TB with default overprovisioning of 200%.
There were no issues creating 2 replicas the first time PV was created, I only realized it failed to resync and then to rebuild after 1 node was rebooted. Even recreating again a new PV with 2 replicas and then changing replica count to 1 and back to 2 failed to rebuild this PV.