longhorn: [BUG] data corruption due to COW and block size not being aligned during rebuilding replicas
Describe the bug
Data in the cloning volume sometime has mismatched checksum as the data in the source volume
To Reproduce
Cloning a volume 100 times. Sometime, we can see this problem. Below is the steps to clone a volume 100 times
- Deploy this test pod which contains a function that repeatedly clone a volume 100 times and compare the checksum of data
- exec into the test pod and run
pytest -svvl test_csi_snapshotter.py::test_csi_snapshot_snap_create_volume_from_snapshot_2 - See error
test_csi_snapshotter.py::test_csi_snapshot_snap_create_volume_from_snapshot_2 run number: 0 run number: 1 run number: 2 run number: 3 run number: 4 run number: 5 run number: 6 run number: 7 run number: 8 run number: 9 run number: 10 run number: 11 run number: 12 run number: 13 run number: 14 run number: 15 FAILED > assert expected_md5sum == created_md5sum E AssertionError: assert '4f4451eb7ed3915e9e7cecabdcb84b90\n' == 'e281b2d2b6de757c34bf5460de97ebb7\n' E - 4f4451eb7ed3915e9e7cecabdcb84b90 E + e281b2d2b6de757c34bf5460de97ebb7
Expected behavior
Checksum should always be the same
Environment
Env was created by using this terraform script with customized values:
use_hdd = true
arch = "amd64"
lh_aws_instance_type_controlplane = "t2.xlarge"
lh_aws_instance_type_worker = "t2.xlarge"
- Longhorn version: master-head Aug 02, 2022
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (14 by maintainers)
Test case status
I paused the test when I hit the checksum mismatch. Here is the support bunle longhorn-support-bundle_9b928743-a394-4beb-8260-888016cb4427_2022-09-13T21-52-37Z.zip. The test runs output:
The source volume is: longhorn-testvol-p37e1k. The target volume is: pvc-2052eefc-c0a4-4cee-b637-baf243130130
Investigating steps
Checking support bundle logs to see any error related to the clone: There is none. Event and logs from target volume look normal
Checking the checksum of the snapshot files of the replicas of the target volume. I do see the checksum mismatch of volume-head between the replicas:
pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-8fcce01bof the target volume (i.e., this replica get data cloned from the source volume):pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-17536a2e(this replica get rebuilt after the first replicapvc-2052eefc-c0a4-4cee-b637-baf243130130-r-8fcce01bcloned data from the source volume):pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-833697c3(this replica get rebuilt after the first replicapvc-2052eefc-c0a4-4cee-b637-baf243130130-r-8fcce01bcloned data from the source volume):Could be the case that the rebuilding/prune process is not working correctly? To check this I am going to detach the target volume. Expose each replica as a docker volume to check which replica has the correct data. It turned out that the 1st and the 2nd replica have the same data as the source volume but the third replica has the wrong data (mounting the exposed block device from these replicas and do md5sum for the
testdata):One interesting point to notice is that the checksum of
testfile in the 3rd replica,pvc-2052eefc-c0a4-4cee-b637-baf243130130-r-833697c3, is different than the checksum of thetestfile when reading from all three replicas ( it is88a468c5900c9d43a1609c8dc8b51f1cas mentioned on the top of this comment) as well as the checksum of the sourcetestfile. This is because Longhorn read data alternatives from different replicas, so when a replica has wrong data, it will mess up the whole volume even though other replicas have correct data.Next, I adjusted the storage class so that the target volume only has 1 replica. This adjustment eliminated the rebuilding/pruning process. Run the test again, it passed 100 clones:
We now know that the problem is at the rebuilding/pruning steps
Try to rebuild many times
filefile in the volumeOngoing thoughts:
Some suggestions for further testing and investigation:
cc @shuo-wu @derekbit @joshimoo
@chriscchien @PhanLe1010 has another PR to improve the fix and needs further testing later on. Reopened this to continue.
@chriscchien For backing image volume, we need to check the root partition inside the block device as mentioned in the step 4 and 5:
For example:
Great job on the investigation and testing so far. Below are some ideas to help narrow down the root cause.
@innobead @derekbit @shuo-wu this is the issue I was thinking about in regard to dereks snapshot data missmatch issue. This might potentially be related if it’s indeed true that different replicas snapshots might contain different data at point in time T. REF: https://github.com/longhorn/longhorn/issues/4513