longhorn: [BUG] Failed rebuilding a large replica of 3TB

Describe the bug I have a large PV (2 replicas) of 3TB of which data I just rsynced from another storage system; the size of data in the PV is 2.4TB. When I rebooted one node, and replica in the rebooted node always failed and another replica was created but never managed to finish. So I tested with reducing replica count to 1 and increasing again to 2, however rebuilding also failed. The CPU load in the source node was around 1 with a longhorn process running, but no data was actually transfer in the network during this time. When data was originally rsynced from another storage, I assume this snapshot is 2.4TB with no holes in the sparse file. Is the reason it fails because it’s too large? I have other PVs of 50GB size and have no issue rebuilding.

To Reproduce Steps to reproduce the behavior:

  1. Create a PV with 2 replicas of 3TB
  2. Rsync-ed a 2.4TB data
  3. Reduce replica count to 1, increase it again to 2
  4. Rebuilding ran for less than 1 hour, than failed

Expected behavior Replica should rebuild successfully.

Log If applicable, add the Longhorn managers’ log when the issue happens.

time="2021-04-18T06:00:54Z" level=error msg="Failed rebuilding of replica 10.42.1.207:10075" controller=longhorn-engine error="failed to add replica address='tcp://10.42.1.207:10075' to controller 'pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.1.0/longhorn [--url 10.42.2.65:10000 add tcp://10.42.1.207:10075], output , stderr, time=\"2021-04-18T05:13:17Z\" level=info msg=\"Adding replica tcp://10.42.1.207:10075 in WO mode\"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.2.127:10075 as the source for rebuild \"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.1.207:10075 as the target for rebuild \"\ntime=\"2021-04-18T06:00:54Z\" level=fatal msg=\"Error running add replica command: failed to sync files [{FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ActualSize:2593904701440} {FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ActualSize:0} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ActualSize:18315870208} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ActualSize:0} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ActualSize:34347081728} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ActualSize:0}] from tcp://10.42.2.127:10075: rpc error: code = Unavailable desc = transport is closing\"\n, error exit status 1" node=slave-sys2 volume=pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0

You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI. longhorn-support-bundle_a34f5b3a-be96-4afd-8271-6de8b0ebaa01_2021-04-18T06-33-04Z.zip

Environment:

  • Longhorn version: 1.1.0
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 1.20.5+rke2r1
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 2
  • Node config
    • OS type and version: Ubuntu 20.04.2
    • CPU per node: 32
    • Memory per node: 256GB
    • Disk type(e.g. SSD/NVMe): NVMe
    • Network bandwidth between the nodes: 10Gbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: ~20

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 32 (21 by maintainers)

Most upvoted comments

Rebuilding and snapshot merge showed different behaviour. For snapshot merge, I saw progress at 20%, hours later at 40%, etc. For rebuilding, the UI show “Rebuilding …” as in pic in previous comment, but there’s no network transfer for this snapshot, so it’s effectively at 0% for 8 hours.

The issue is at GetFiemapExtents function in the client.go. I downloaded this fiemap.c, and run this against the source snapshot. It really took 7.5h, as well as load at 1cpu, and the process is hard to kill (the same behaviour when longhorn was running during initial 8h): File volume-snap-7432db6c-6aef-4bd7-901d-a4cd5ccea236.img has 1216060 extents: 0.00user 26854.01system 7:28:05elapsed 99%CPU (0avgtext+0avgdata 134160maxresident)k 50368inputs+0outputs (0major+33621minor)pagefaults 0swaps

Nvme disk rating with dbench is: Random Read/Write IOPS: 260k/233k. BW: 3210MiB/s / 2675MiB/s Average Latency (usec) Read/Write: 99.53/25.30 Sequential Read/Write: 3394MiB/s / 2660MiB/s Mixed Random Read/Write IOPS: 166k/55.4k

To show it’s not the structure of the extents that is an issue, I created a sparse file using your method above (but with 4k data size), alternating between data and holes, so 4k data, 4k hole, 4k data, 4k hole, etc. Then I ran the fiemap on it, it took 6h.

To show it’s not an OS/disk issue, I did same on CentOS 8.2 and different ssd disk (the above ran on Ubuntu 20.04.2 and nvme disk), it took 7h.

Then I did the same on XFS filesystem. It took 1s!!

So, the ext4 based sparse file probably has to traverse all the trees of the extents while xfs seems to keep this info somewhere. I don’t know if mkfs.ext4 has options so it can do like xfs. But now I retested the 1.9T production data using xfs. Initial write from outside source is 30% slower, but snapshot merging now takes less than 2h. Rebuilding doesn’t have that 8h lead time. Normal read/write performance is similar to ext4. Parallel checksum for source and target snapshot will also be helpful to reduce rebuilding when target snapshot exists.

Thanks for the help with this troubleshooting, the rebuilding steps you mentioned above is really helpful.

Verified with v1.1.2-rc1

Validation - Pass

Cluster Config : 1 Control plane, 3 worker nodes (4 vcpus, 8GB RAM) Host OS : Ubuntu 20.04

cat /proc/version
Linux version 5.4.0-73-generic (buildd@lcy01-amd64-019) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04))

File System : ext4


Test with files created on workload fs : Created test files inside of the workload filesystem. Used the dd command to generate the data with the skip parameter. e.g. dd if=/dev/urandom of=file1.txt count=100 bs=1M conv=fsync skip=10k

Data / Version v1.1.1 v1.1.2-rc1
2 Gi (2 files of 1 Gi) with 10K holes after each block ~1 min 55 sec ~1 min 42 sec
2 Gi (2000 files of 1M) with 10K holes after each block ~2 min 30 sec ~1 min 9 sec
10 Gi (10000 files of 1M) with 100K holes after each block ~12 min ~ 3 min 40 sec
50 Gi (50000 files of 1M) with 10k holes after each block ~39 min ~21 min
50 Gi ( 50 Files of 1G) ~1 hr 1 min 15 sec ~25 min

Test with above fio jobs : Used the above fio jobs to populate the data in the volume

  • Best Case - 128k data 4k holes - 5G
    v1.1.1 v1.1.2-rc1
    5 min 4 sec 2 min 36 sec
  • Worst Case - 4k data 4k holes - 5G
    v1.1.1 v1.1.2-rc1
    11 hr 58 min 2 hr 57 min

File System : XFS Data Size : 10 Gi with 10k holes introduced after each block

v1.1.1 v1.1.2-rc1
9 min 26 sec 3 min 11 sec

Note: The checksum of data is compared after each rebuild across all the replicas.

Creation of the Test Volume.

FIO jobs: You can use Job10 as a template if you only want to populate a single volume. Then update the engines in between the rebuild tests. i.e. test a single rebuild at a time. This way you only need to run fio once to populate the volume. To use this attach the <volume-name> for job 10 it’s rebuild-test to a random cluster node then run fio fio.job This populates the volume and creates the appropriate sparse file on the replicas.

[global]
bs=4k
iodepth=128
direct=1
sync=0
end_fsync=1
ioengine=libaio
randseed=1
randrepeat=0
# buffer_pattern="X"
group_reporting

# 4k bs
[job1]
bs=4k
rw=write:4k
filename=/dev/longhorn/rebuild-4k-old
name=data4k-holes4k-old
io_size=5G

[job2]
bs=4k
rw=write:4k
filename=/dev/longhorn/rebuild-4k-new
name=data4k-holes4k-new
io_size=5G


# 128k bs
[job4]
bs=128k
rw=write:4k
filename=/dev/longhorn/rebuild-128k-old
name=data128k-holes4k-old
io_size=5G

[job5]
bs=128k
rw=write:4k
filename=/dev/longhorn/rebuild-128k-new
name=data128k-holes4k-new
io_size=5G


# 512k bs
[job7]
bs=512k
rw=write:4k
filename=/dev/longhorn/rebuild-512k-old
name=data512k-holes4k-old
io_size=5G

[job8]
bs=512k
rw=write:4k
filename=/dev/longhorn/rebuild-512k-new
name=data512k-holes4k-new
io_size=5G

# 50GB test, need to switch engines between tests
[job10]
bs=128k
rw=write:4k
filename=/dev/longhorn/rebuild-test # specify the <volume-name>
name=test-total50gb-data128k-holes4k
io_size=50G # you can specify io_size to limit the volume to only be half written for example

@joshimoo Thanks for explaining the expected behavior and testing steps in details. The result looks good as per validation comment above https://github.com/longhorn/longhorn/issues/2507#issuecomment-866550145

Hi. They’re degraded because default replica count is 3, but I only have 2 nodes at the moment (slave-sys2/3). Most of the PVs are in ssd disks and have no issues because most PVs have size less than 50G. The one failing rebuilding is on NVMe. NVMe size is 6TB with default overprovisioning of 200%.

There were no issues creating 2 replicas the first time PV was created, I only realized it failed to resync and then to rebuild after 1 node was rebooted. Even recreating again a new PV with 2 replicas and then changing replica count to 1 and back to 2 failed to rebuild this PV.