longhorn: [BUG] data corruption due to COW and block size not being aligned during rebuilding replicas

Describe the bug

Data in the cloning volume sometime has mismatched checksum as the data in the source volume

To Reproduce

Cloning a volume 100 times. Sometime, we can see this problem. Below is the steps to clone a volume 100 times

Deploy this test pod which contains a function that repeatedly clone a volume 100 times and compare the checksum of data
exec into the test pod and run pytest -svvl test_csi_snapshotter.py::test_csi_snapshot_snap_create_volume_from_snapshot_2

See error

test_csi_snapshotter.py::test_csi_snapshot_snap_create_volume_from_snapshot_2 run number: 0
run number: 1
run number: 2
run number: 3
run number: 4
run number: 5
run number: 6
run number: 7
run number: 8
run number: 9
run number: 10
run number: 11
run number: 12
run number: 13
run number: 14
run number: 15
FAILED
>           assert expected_md5sum == created_md5sum
E           AssertionError: assert '4f4451eb7ed3915e9e7cecabdcb84b90\n' == 'e281b2d2b6de757c34bf5460de97ebb7\n'
E             - 4f4451eb7ed3915e9e7cecabdcb84b90
E             + e281b2d2b6de757c34bf5460de97ebb7

Expected behavior

Checksum should always be the same

Environment

Env was created by using this terraform script with customized values:

use_hdd = true
arch = "amd64"
lh_aws_instance_type_controlplane = "t2.xlarge"
lh_aws_instance_type_worker = "t2.xlarge"

Longhorn version: master-head Aug 02, 2022

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (14 by maintainers)

Most upvoted comments

OS	Disk Type	Res	Additional Info
Ubuntu	SSD	✅	Cloning 200 times
Ubuntu	HDD	✅	Cloning 50 times since it takes very long time on HDD
Sles	SSD	✅	Cloning 200 times
Sles	HDD	❌	Cloning 2-16 times

Test case status

I paused the test when I hit the checksum mismatch. Here is the support bunle longhorn-support-bundle_9b928743-a394-4beb-8260-888016cb4427_2022-09-13T21-52-37Z.zip. The test runs output:

platform linux -- Python 3.9.13, pytest-5.3.1, py-1.11.0, pluggy-0.13.1 -- /usr/bin/python3.9
cachedir: .pytest_cache
rootdir: /integration, inifile: pytest.ini
plugins: order-1.0.1, repeat-0.9.1
collected 1 item                                                                                                                                  

test_csi_snapshotter.py::test_csi_snapshot_snap_create_volume_from_snapshot_2 run number: 0
run number: 1
run number: 2
run number: 3
run number: 4
run number: 5
run number: 6
run number: 7
run number: 8
run number: 9
expected_md5sum: b2a2e36a43f9e334a07f85d75c3129da

created_md5sum: 88a468c5900c9d43a1609c8dc8b51f1c

Press c to continue:

The source volume is: longhorn-testvol-p37e1k. The target volume is: pvc-2052eefc-c0a4-4cee-b637-baf243130130

Investigating steps

Checking support bundle logs to see any error related to the clone: There is none. Event and logs from target volume look normal

Checking the checksum of the snapshot files of the replicas of the target volume. I do see the checksum mismatch of volume-head between the replicas:

This is the checksum of the first replica, pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-8fcce01b of the target volume (i.e., this replica get data cloned from the source volume):

ip-10-0-2-11:/var/lib/longhorn/replicas/pvc-2052eefc-c0a4-4cee-b637-baf243130130-f1d332bd # md5sum *
4a4f668e07a97d86991410b53454a9f1  revision.counter
5fecbd5c9cc86e988ca8bfc08b3258bb  volume-head-003.img
366f6c06cd7c4347ff7885ceb2f16f71  volume-head-003.img.meta
914b9e65cf0d573b77b2b3b995616fc6  volume-snap-a263ad02-fddc-42c0-ab67-31a3668beaff.img
8d44e39875a262f59417050f281e73d5  volume-snap-a263ad02-fddc-42c0-ab67-31a3668beaff.img.meta
97560a9b0b8c2d74d812cfddbb68a318  volume.meta

This is the checksum of the second replica, pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-17536a2e (this replica get rebuilt after the first replica pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-8fcce01b cloned data from the source volume):

ip-10-0-2-153:/var/lib/longhorn/replicas/pvc-2052eefc-c0a4-4cee-b637-baf243130130-32ee9cec # md5sum *
4a4f668e07a97d86991410b53454a9f1  revision.counter
5fecbd5c9cc86e988ca8bfc08b3258bb  volume-head-002.img
6ea5583486fb96ee359b0e34f5804688  volume-head-002.img.meta
914b9e65cf0d573b77b2b3b995616fc6  volume-snap-a263ad02-fddc-42c0-ab67-31a3668beaff.img
8d44e39875a262f59417050f281e73d5  volume-snap-a263ad02-fddc-42c0-ab67-31a3668beaff.img.meta
7a91d75d0e8f9fc5473df91fa17ab605  volume.meta

This is the checksum of the 3rd replica, pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-833697c3 (this replica get rebuilt after the first replica pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-8fcce01b cloned data from the source volume):

ip-10-0-2-93:/var/lib/longhorn/replicas/pvc-2052eefc-c0a4-4cee-b637-baf243130130-66327b7f # md5sum *
4a4f668e07a97d86991410b53454a9f1  revision.counter
f6918078796be0bb38f00d7add1e78ed  volume-head-001.img
6ea5583486fb96ee359b0e34f5804688  volume-head-001.img.meta
914b9e65cf0d573b77b2b3b995616fc6  volume-snap-a263ad02-fddc-42c0-ab67-31a3668beaff.img
8d44e39875a262f59417050f281e73d5  volume-snap-a263ad02-fddc-42c0-ab67-31a3668beaff.img.meta
73de95cadc95f40f759c2f5a62ef6d2e  volume.meta

Could be the case that the rebuilding/prune process is not working correctly? To check this I am going to detach the target volume. Expose each replica as a docker volume to check which replica has the correct data. It turned out that the 1st and the 2nd replica have the same data as the source volume but the third replica has the wrong data (mounting the exposed block device from these replicas and do md5sum for the test data):
```
# pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-8fcce01b
ip-10-0-2-11:/ # md5sum mnt/test 
b2a2e36a43f9e334a07f85d75c3129da  mnt/test
---
# pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-17536a2e
ip-10-0-2-153:/ # md5sum mnt/test 
b2a2e36a43f9e334a07f85d75c3129da  mnt/test
---
# pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-833697c3
ip-10-0-2-93:/ # md5sum mnt2/test 
7511cb63ed25169d1b0afca54983a360  mnt2/test
```
One interesting point to notice is that the checksum of test file in the 3rd replica, pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-833697c3, is different than the checksum of the test file when reading from all three replicas ( it is 88a468c5900c9d43a1609c8dc8b51f1c as mentioned on the top of this comment) as well as the checksum of the source test file. This is because Longhorn read data alternatives from different replicas, so when a replica has wrong data, it will mess up the whole volume even though other replicas have correct data.
Next, I adjusted the storage class so that the target volume only has 1 replica. This adjustment eliminated the rebuilding/pruning process. Run the test again, it passed 100 clones:

OS Disk Type Target volume’s replica count Res Additional Info

Sles HDD 3 ❌ Cloning 2-16 times

Sles HDD 1 ✅ Cloning 100 times

We now know that the problem is at the rebuilding/pruning steps
Try to rebuild many times
- Create a volume of 500MB of 1 replica
- Create ext4 filesystem on the volume
- Mount the volume and create 300MB file file in the volume
- Scale number of replicas to 2
- Repeatedly delete one of the replicas to trigger rebuilding (4-5 times) but there is always one healthy replica at all time
- Wait for the volume to have 2 healthy replicas again
- umount and detach the volume (to avoid the caching stuff)
- Attach and mount the volume. Hit filesystem corrupted error:
```
ip-10-0-2-11:/ # mount /dev/longhorn/testvol mnt/
mount: /mnt: mount(2) system call failed: Structure needs cleaning.
```

OS	Disk Type	Target volume’s replica count	Res	Additional Info
Sles	HDD	3	❌	Cloning 2-16 times
Sles	HDD	1	✅	Cloning 100 times

Ongoing thoughts:

Next, try to rebuilding many times
Walk through the rebuilding code path
Try to disable purge feature after rebuilding

PhanLe1010 on Sep 14, 2022

Some suggestions for further testing and investigation:

Use version v1.2.4 to avoid the potential impact of feature snapshot pruning (which is available since v1.2.5).
Does Longhorn need to do a clone for the reproducing? As we discussed before, the direct cause is rebuilding.

shuo-wu on Sep 15, 2022

cc @shuo-wu @derekbit @joshimoo

innobead on Aug 31, 2022

@chriscchien @PhanLe1010 has another PR to improve the fix and needs further testing later on. Reopened this to continue.

innobead on Sep 30, 2022

@chriscchien For backing image volume, we need to check the root partition inside the block device as mentioned in the step 4 and 5:

4. Find the major and minor version of the block device by ls -l /dev/longhorn/<volume-name>
5. Find the corresponding block device under lsblk with the same major and minor version. Then find the root partition of that block device

For example:

root@phan-v187-pool2-d9ab6ae2-4cwtp:/# ls /dev/longhorn/testvol-bi -l
brw-rw---- 1 root root 8, 0 Sep 24 01:47 /dev/longhorn/testvol-bi
root@phan-v187-pool2-d9ab6ae2-4cwtp:/# lsblk | grep "8:0" -A 4
sda       8:0    0   10G  0 disk 
`-sda1    8:1    0    8G  0 part /mnt
sdb       8:16   0   10G  0 disk 
`-sdb1    8:17   0    8G  0 part 
vda     252:0    0   80G  0 disk 
root@phan-v187-pool2-d9ab6ae2-4cwtp:/# xfs_info /dev/sda1 
meta-data=/dev/sda1              isize=512    agcount=4, agsize=524224 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2096896, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

PhanLe1010 on Sep 29, 2022

Great job on the investigation and testing so far. Below are some ideas to help narrow down the root cause.

check if this is a regression by testing with prior version
check the delta between the 2 head files – off by one io/extent (either during merge of the snapshot files or read write modify) – continuously wrong over time – byte level error vs extent/io error – did the error happen at the beginning shortly after rebuild or at the end while flushing
since the revision counter is identical one can assume that the replica.Write was called an equal amount
try to reproduce this with a raw block dev, then write a known data pattern over it – this allows you to identify the exact point of failure via the diff

joshimoo on Sep 15, 2022

@innobead @derekbit @shuo-wu this is the issue I was thinking about in regard to dereks snapshot data missmatch issue. This might potentially be related if it’s indeed true that different replicas snapshots might contain different data at point in time T. REF: https://github.com/longhorn/longhorn/issues/4513

joshimoo on Sep 2, 2022