longhorn: [BUG] Backing Image Data Inconsistency if it's Exported from a Backing Image Backed Volume
Describe the bug (đ if you encounter this issue)
The volume encounters a data integrity issue, if itâs from a backing image, and such backing image is exported by a volume backed by another backing image.
To Reproduce
- Preparing a Harvester v1.2.0-head cluster (for using volume as an OS root disk)
- Uploading Ubuntu jammy cloud image (latest version), img1
- Creating volume vol1 from img1
- Creating a VM with vol1 as the root disk, and writing some data
- Issuing sync on filesystem, e.q.
$ sudo sync - Stopping VM, and waiting until vol1 is completely detached
- Exporting vol1 as img2
- Creating vol2 from img2
- Comparing data content between vol1 and vol2, the data is inconsistent
- Running
e2fsck -von vol2, fsck will report filesystem errors and require data repair
Expected behavior
The data of vol1 and vol2 should be consistent.
Support bundle for troubleshooting
supportbundle_066aa8ed-a12b-4ccc-971c-59404ee44cd4_2023-10-16T07-46-44Z.zip
Environment
- Longhorn version: v1.4.3
- Harvester version: v1.2.0-head
- Number of Longhorn volumes in the cluster: 2
- Impacted Longhorn resources:
- Volume names: pvc-b8bd9013-e8e9-449d-9c25-6c6c0824d0b3, pvc-de64dd95-b0c1-4ce9-864a-88d38e06d9d3
Additional context
- If vol1 is not backed by backing image, there is no data inconsistency found between vol1 and vol2
- A single node Harvester cluster should be enough to reproduce this issue
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 2
- Comments: 45 (41 by maintainers)
I think I found the bug When reading [2MB-4MB] from the Volume
https://github.com/longhorn/longhorn-engine/blob/deb8b18a1558ddf4f359146b289d5469bef7cc6d/pkg/replica/diff_disk.go#L202-L241 Here when we try to read the data
bufin this rangeI wonder do we really need to cut it with count?
Or is the alignment here correct?Or maybe we should still return the buf length count when reading from expanded part like we are reading a bunch of â0â instead of 0 count.cc @shuo-wu @PhanLe1010 @WebberHuang1118
I found a method to reproduce a similar case without Harvester-specific operations (no VM operation, only creating PVC from backing image and I/O access)
Detailed steps:
$cmp -l os-vol copy-vol$e2fsck -c /dev/sda1(assume PVC copy-vol mapped to /dev/sda)PS:
TEST (By directly returning
datainstead ofdata[:n]) - PASSEDThe Volume are now consistent even testing with the original Test Case
Summary
I think this is the correct fix. When sending the batch with the interval, we allocate the buf with the batch size (interval size) then we send the data with the interval information to the target volume to write the data in that interval we should not cut the data with the count we actually read from volume because the count might be 0 in case of volume expansion or BackingImage And I donât think in this case the count number matters
Hi @PhanLe1010 I think you are right
Backing Image Export Test (using empty raw image with larger volume size) - FAILED
Test Result
test-empty-exportis empty and the comparison result shows that it dffers from 3MB offsetcc @shuo-wu
Hi @PhanLe1010 Thanks for the advice! Here is the testing results
BackingImage Export (use raw image) - FAILED
fdiskrecreate the partition andresize2fsto resize the filesystemTwo file sizes are slightly different
Attach two volume to the host and the cksum are not the same
Exported BackingImage with the Volume has no filesystem corrupted issue
The data written in to the device are the same
Conclusion
cc @shuo-wu
This example assumes that the logical (virtual) size of the backing image is equal to the logical size of snapshot:
In reality, they are not:
Revisiting 2 cases:
I am thinking if it is because
When SyncContent() and GetDataLayout(), we actually check if the sector is Data or Hole by checking if the DiskIndex is 0 or non-0 If it is Data, we read the data from the replica of the interval and send to the target If it is Hole, we simply ask the target to punch Hole
So in the Case 1, we still punch hole for sector 2, 3 But in the Case 2, since we init the index map to all â1â, so all the sector is Data, we donât do punch hole. Then there might have some sectors should be Hole because it doesnât have data in BackingImage either, but we fill the sector with 0 instead of Hole?
GetDataLayout()
Note: If we donât recreate the partition, the export BackingImage would be consistentâŚbut why? TestCase
cc @shuo-wu @PhanLe1010
Some observations with the following steps:
bi default-image-s9rfj), which is640MBvolume pvc-c3b1a434-86ee-453b-afc7-a52fe113c059) from img1 with claim size 10GBbi default-image-j8ghr) with replica number is 1523MB2507MB$cmp -l os-vol copy-volThanks
After discussing with @ChanYiLin, we agree that Longhorn should return the length the caller wants to read if there is no error
EOFsure no problem, thanks for your help! I think we are pretty close to the answer
Awesome. It means somehow the region outside of the backing image boundary (3MB offset to 4MB offset region) in the backing image layer is overwriting the snapshot layer.
Maybe next step would be adding some logs to the preload() function to see if we can identify the faulted logic (why choosing the backing image layer instead of snapshot layer for the region 3MB-4MB)?
Yes, that is what I observed here.
And the only different between these two is preload()
I can reproduce the issue, will look into how we export BackingImage
cc @shuo-wu
Verified pass on longhorn master(longhorn-engine
ac4748), longhorn v1.6.x(longhorn-engine0b553a) with test stepsI could reproduce this issue before the fix, after the fix, the two volumes data are identical.
From my perspective, I think we should still return
len([data])fordiffDisk.fullReadAtThe reason is that, from the callerâs perspective, it is reading from the"Volume"So it has no knowledge about the snapshot chain underneath.
If it reads outside of the boundary of the
"Volume", we do returnio.ErrUnexpectedEOFerror here But if it only reads outside of the boundary of"specific snapshot image", and we allow it to do so, then we should consider it as reading a bunch of â0â, and that is also true. The caller does get a bunch of â0â. The caller should not take care of the situation. It might get confused that it doesnât get EOF error but how come the data length is not correct.@PhanLe1010 you can search
[DEBUG in syncDataInterval]: Sending batch: [ 512: 1024](512), dataBuffer:in the file that is the data sending from the sorce volume the target volumeSince the sending batch size is 2MB and the sector size is 4KB so the [ 512: 1024] means [2MB: 4MB]
The Sending Batch Size is 2MB, so for 16MB Volume It only sends 8 Batch: [0, 2MB], [2MB, 4MB], âŚ, I printed the read data out and found out that Event it was reading [2MB, 4MB] position from the replicas, the data is all 0 However, I also printed out the reading disk index during the
ReadAt(), it was reading from the correct disk. Need to investiage more.Here is the log lhim2_new2.txt
Sorry @ChanYiLin. I will have to continue tomorrow.
Hi @PhanLe1010 I did intentionally print the r.volume.location after the preload and also print the segments sending during the exporting Here is the results
As you can see the preload location I think is correct, there is a series of 2 which means the data from the snapshot
But when doing exporting, it only sent one segment [D 0 4096] and then finishedMisleading here, the 4096 here means 4096 Blocks, each Block is 4096Bytes It is correct, it is sending 16MB in one interval because all blocks are Data
Note: I add the log message here, this function will be executed by multiple goroutine in parallel when sending segment lhim_log.txt
Note:
r.Volume.LocationHi @PhanLe1010 Thanks for the idea But we have confirmed that using qcow2 image, if we directly mount the filesystem, without recreating the partition, the exported BackingImage will be consistent. Here is the test case
I have also tested that
This is also a automation test case in Longhorn
The question is, in case 2,
location = [2, 1, 1, 2]means [reading data in snap1, reading data in BI, reading data in BI, reading data in snap1 ]. And reading data from BI should get the same result, which is a bunch of zero. In other words, the final data set should be identicalâŚDoes the volume clone test and the failed backing image test mean that the new partitioning info can be cloned correctly via
VolumeClonebut cannot be exported viaExportingBackingImage?And also we found that
So maybe the partition information corrupted when exporting those information should be in the new snapshot image and overwrites the base BackingImage when exporting