longhorn: [BUG] Backing Image Data Inconsistency if it's Exported from a Backing Image Backed Volume

Describe the bug (🐛 if you encounter this issue)

The volume encounters a data integrity issue, if it’s from a backing image, and such backing image is exported by a volume backed by another backing image.

To Reproduce

Preparing a Harvester v1.2.0-head cluster (for using volume as an OS root disk)
Uploading Ubuntu jammy cloud image (latest version), img1
Creating volume vol1 from img1
Creating a VM with vol1 as the root disk, and writing some data
Issuing sync on filesystem, e.q. $ sudo sync
Stopping VM, and waiting until vol1 is completely detached
Exporting vol1 as img2
Creating vol2 from img2
Comparing data content between vol1 and vol2, the data is inconsistent
Running e2fsck -v on vol2, fsck will report filesystem errors and require data repair

Expected behavior

The data of vol1 and vol2 should be consistent.

Support bundle for troubleshooting

supportbundle_066aa8ed-a12b-4ccc-971c-59404ee44cd4_2023-10-16T07-46-44Z.zip

Environment

Longhorn version: v1.4.3
Harvester version: v1.2.0-head
Number of Longhorn volumes in the cluster: 2
Impacted Longhorn resources:
- Volume names: pvc-b8bd9013-e8e9-449d-9c25-6c6c0824d0b3, pvc-de64dd95-b0c1-4ce9-864a-88d38e06d9d3

Additional context

If vol1 is not backed by backing image, there is no data inconsistency found between vol1 and vol2
A single node Harvester cluster should be enough to reproduce this issue

About this issue

Original URL
State: closed
Created 8 months ago
Reactions: 2
Comments: 45 (41 by maintainers)

Most upvoted comments

I think I found the bug When reading [2MB-4MB] from the Volume

[2MB-3MB] is BackingImage
[3MB-4MB] is snap1

https://github.com/longhorn/longhorn-engine/blob/deb8b18a1558ddf4f359146b289d5469bef7cc6d/pkg/replica/diff_disk.go#L202-L241 Here when we try to read the data

If the target disk is out of bound, it can because of Volume Expansion or BackingImage.
- In this case the preload location indicated that [2MB-3MB] is BackingImage and tried to read from it.
Then we directly returned 0 count because we cannot read any data from BackingImage to the buf in this range

	// May read the expanded part
	if offset >= size {
		return 0, nil
	}

For [3MB-4MB], the target disk was correct, and we read the data from snap1 to the buf and the count would be 1MB
But we cut the return buf with count after reading from the file here. So the return data abandoned the data we read from the snap1 and only leave the first half of “0” data.

func ReadDataInterval(rw ReaderWriterAt, dataInterval Interval) ([]byte, error) {
	data := AllocateAligned(int(dataInterval.Len()))
	n, err := rw.ReadAt(data, dataInterval.Begin)
	if err != nil {
		if err == io.EOF {
			log.Debugf("Have read at the end of file, total read: %d", n)
		} else {
			log.WithError(err).Errorf("Failed to read interval %+v", dataInterval)
			return nil, errors.Wrapf(err, "failed to read interval %+v", dataInterval)
		}
	}
	return data[:n], nil
}

I wonder do we really need to cut it with count?
~~Or is the alignment here correct?~~ Or maybe we should still return the buf length count when reading from expanded part like we are reading a bunch of “0” instead of 0 count.

cc @shuo-wu @PhanLe1010 @WebberHuang1118

ChanYiLin on Jan 16, 2024

I found a method to reproduce a similar case without Harvester-specific operations (no VM operation, only creating PVC from backing image and I/O access)

Detailed steps:

Uploading Ubuntu jammy cloud image (latest version), img1
Creating PVC os-vol from img1 with claim size 10GB and 1 replica number

Applying the following yaml to mount os-vol as block mode in the pod

apiVersion: v1
kind: Pod
metadata:
  name: block-volume-test
  namespace: default
spec:
  containers:
  - name: volume-test
    image: ubuntu
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    command: ["/bin/sleep"]
    args: ["3600"]
    volumeDevices:
    - devicePath: /dev/os-vol
      name: os-vol
  volumes:
    - name: os-vol
      persistentVolumeClaim:
        claimName: os-vol

Issuing filesystem expand and I/O on os-vol
- Enter pod block-volume-test
- Install fdisk
  - $ apt update; apt install -y fdisk
- Expanding filesystem on os-vol
  - Check device (/dev/sda for exmaple) mapped to os-vol by device number
  - $ fdisk /dev/sda
    - d 1 n 1 default default No w
  - $ resize2fs -fp /dev/sda1
- Mounting /dev/sda1 and writing I/O into it
  - $ mkdir /data; $ mount -t ext4 /dev/sda1 /data
  - $ dd if=/dev/zero of=/data/file bs=1M count=2048
- Flushing data to storage for os-vol
  - $ umount /data
- Delete pod block-volume-test, and waiting until os-vol’s LH volume detach
Exporting os-vol as image os-vol.img
Creating PVC copy-vol from os-vol.img with claim size 10GB
Comparing data content between os-vol and copy-vol with $cmp -l os-vol copy-vol
- The data content is different between the two PVCs
Finding filesystem error on copy-vol with $e2fsck -c /dev/sda1 (assume PVC copy-vol mapped to /dev/sda)
- There is no filesystem error on os-vol

PS:

IMHO, it may be related resizefs, since it contains more complicated I/O patterns, including discard to image block ranges.

WebberHuang1118 on Jan 8, 2024

TEST (By directly returning `data` instead of `data[:n]`) - PASSED

The Volume are now consistent even testing with the original Test Case

cksum are the same
filesystem is not corrupted

Summary

I think this is the correct fix. When sending the batch with the interval, we allocate the buf with the batch size (interval size) then we send the data with the interval information to the target volume to write the data in that interval we should not cut the data with the count we actually read from volume because the count might be 0 in case of volume expansion or BackingImage And I don’t think in this case the count number matters

ChanYiLin on Jan 16, 2024

Hi @PhanLe1010 I think you are right

Backing Image Export Test (using empty raw image with larger volume size) - FAILED

Create an empty raw image and upload as a BackingImage empty-raw
- $ truncate -s 2M empty-raw
Create a 16MB Volume test-empty with the raw image
Attach to the host
dd 1MB data starts from 3MB offset
- $ dd of=/dev/longhorn/test-empty if=/dev/urandom bs=1M count=1 oflag=seek_bytes seek=3145728
export the Volume to another BackingImage empty-raw-export
Create a 16MB Volume test-empty-export with the BackingImage empty-raw-export
Attach to the host

Test Result

test-empty-export is empty and the comparison result shows that it dffers from 3MB offset

brw-rw----  1 root root 8, 48 Jan 11 05:35 test-empty
brw-rw----  1 root root 8, 64 Jan 11 05:36 test-empty-export
root@kworker1:/dev/longhorn# cat test-empty-export
root@kworker1:/dev/longhorn# cmp test-empty test-empty-export
test-empty test-empty-export differ: byte 3145729, line 1
root@kworker1:/dev/longhorn#

cc @shuo-wu

ChanYiLin on Jan 11, 2024

Hi @PhanLe1010 Thanks for the advice! Here is the testing results

BackingImage Export (use raw image) - FAILED

Create a raw image (I convert the jammy image from qcow2 to raw) and a Volume with it
Mount to the Pod
fdisk recreate the partition and resize2fs to resize the filesystem
Mounting device and writing I/O into it
- $ mkdir /data; $ mount -t ext4 /dev/sdb1 /data
- $ dd if=/dev/zero of=/data/file bs=1M count=2048
- $ sync
Export the Volume to another BackingImage
Create a Volume with the BackingImage

Two file sizes are slightly different

root@kworker2:/var/lib/longhorn# du -sm replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/*
1	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/revision.counter
0	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume-head-001.img
1	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume-head-001.img.meta
2141	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume-snap-test-export-806bbe7b.img
1	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume-snap-test-export-806bbe7b.img.meta
1	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume.meta

root@kworker2:/var/lib/longhorn# du -sm backing-images/test-raw-61fde503/*
1517	backing-images/test-raw-61fde503/backing
1	backing-images/test-raw-61fde503/backing.cfg

root@kworker2:/var/lib/longhorn# du -sm backing-images/test-export-c504be3c/*
3868	backing-images/test-export-c504be3c/backing
1	backing-images/test-export-c504be3c/backing.cfg

Attach two volume to the host and the cksum are not the same

root@kworker2:/dev/longhorn# ll
total 0
drwxr-xr-x  2 root root    80 Jan 11 03:24 ./
drwxr-xr-x 20 root root  4260 Jan 11 03:24 ../
brw-rw----  1 root root 8, 32 Jan 11 03:24 pvc-6a0211dd-09d8-4bb7-a95e-d054988dd958
brw-rw----  1 root root 8, 16 Jan 11 03:17 pvc-b334ec5f-ab8f-4285-bffe-743b542ec494
root@kworker2:/dev/longhorn# cksum pvc-b334ec5f-ab8f-4285-bffe-743b542ec494 
969496369 5368709120 pvc-b334ec5f-ab8f-4285-bffe-743b542ec494
root@kworker2:/dev/longhorn# cksum pvc-6a0211dd-09d8-4bb7-a95e-d054988dd958
2208768380 5368709120 pvc-6a0211dd-09d8-4bb7-a95e-d054988dd958

Exported BackingImage with the Volume has no filesystem corrupted issue

root@block-volume-test:/# ls -al /dev/os-export 
brw-rw---- 1 root root 8, 32 Jan 11 03:35 /dev/os-export
root@block-volume-test:/# ls -al /dev/sdc       
brw-rw---- 1 root disk 8, 32 Jan 11 03:35 /dev/sdc
root@block-volume-test:/# e2fsck -c /dev/sdc1
e2fsck 1.46.5 (30-Dec-2021)
Checking for bad blocks (read-only test): done                                                 
cloudimg-rootfs: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

cloudimg-rootfs: ***** FILE SYSTEM WAS MODIFIED *****
cloudimg-rootfs: 73666/645120 files (0.0% non-contiguous), 949441/1282299 blocks

The data written in to the device are the same

root@block-volume-test:/# mkdir /data
root@block-volume-test:/# mkdir /data-export
root@block-volume-test:/# mount -t ext4 /dev/sdb1 /data
root@block-volume-test:/# mount -t ext4 /dev/sdc1 /data-export
root@block-volume-test:/# cksum /data/file
2532515601 2147483648 /data/file
root@block-volume-test:/# cksum /data-export/file
2532515601 2147483648 /data-export/file

Conclusion

With using raw image, two volumes are still not consistent, and the size are different but almost the same.
Exported BackingImage is slightly larger
The filesystem in this case is not corrupted

cc @shuo-wu

ChanYiLin on Jan 11, 2024

This example assumes that the logical (virtual) size of the backing image is equal to the logical size of snapshot:

assuming snapshot chain as follow
0. nil
1. BackingImage [1, 1, 0, 0]
2. Snap1 [1, 0, 0, 1]


Case 1 - Volume Clone: Export without BackingImage
first init to all 0, location = [0, 0, 0, 0]
r.volumes.preload(), location:
i = 0, nil                          skip => location = [0, 0, 0, 0]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [0, 0, 0, 0]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 0, 0, 2]

ExportVolume and GetDataLayout=> [2, 0, 0, 2] which is [Data, Hole, Hole, Data]

Case 2 - BackingImage Export: Export with BackingImage
first init to all 1, location = [1, 1, 1, 1]
r.volumes.preload(), location
i = 0, nil                          skip => location = [1, 1, 1, 1]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [1, 1, 1, 1]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 1, 1, 2]

ExportVolume and GetDataLayout => [2, 1, 1, 2] which is [Data, Data, Data, Data]

In reality, they are not:

The ubuntu image has 2.2 GiB logical size:

➜  Downloads qemu-img info jammy-server-cloudimg-amd64.img 
image: jammy-server-cloudimg-amd64.img
file format: qcow2
virtual size: 2.2 GiB (2361393152 bytes)
disk size: 642 MiB
cluster_size: 65536
Format specific information:
    compat: 0.10
    compression type: zlib
    refcount bits: 16

While the Longhorn snapshot has 10GB logical size

Revisiting 2 cases:

mount the BackingImage to the Pod + write some data => Export BackingImage is consistent -> In this case, all the write fall into the first 2.2GB region of the backing image file
mount the BackingImage to the Pod + recreate partition&resizefs + write some data => Export BackingImage is inconsistent and the filesystem is corrupted -> In this case, some data might fall outside of the 2.2GB region of the backing image file. So something like:
```
assuming snapshot chain as follow
0. nil
1. BackingImage [1, 1, 0, 0]
2. Snap1        [1, 0, 0, 1, 0, 1, 0]
```
I am wondering if reading data from the qcow2 backing image outside of its virtual size can cause any inconsistency? Could we test with a raw backing image instead to see if we can hit the same issue @ChanYiLin ?

PhanLe1010 on Jan 11, 2024

I am thinking if it is because

assuming snapshot chain as follow
0. nil
1. BackingImage [1, 1, 0, 0]
2. Snap1 [1, 0, 0, 1]


Case 1 - Volume Clone: Export without BackingImage
first init to all 0, location = [0, 0, 0, 0]
r.volumes.preload(), location:
i = 0, nil                          skip => location = [0, 0, 0, 0]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [0, 0, 0, 0]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 0, 0, 2]

ExportVolume and GetDataLayout=> [2, 0, 0, 2] which is [Data, Hole, Hole, Data]

Case 2 - BackingImage Export: Export with BackingImage
first init to all 1, location = [1, 1, 1, 1]
r.volumes.preload(), location
i = 0, nil                          skip => location = [1, 1, 1, 1]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [1, 1, 1, 1]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 1, 1, 2]

ExportVolume and GetDataLayout => [2, 1, 1, 2] which is [Data, Data, Data, Data]

When SyncContent() and GetDataLayout(), we actually check if the sector is Data or Hole by checking if the DiskIndex is 0 or non-0 If it is Data, we read the data from the replica of the interval and send to the target If it is Hole, we simply ask the target to punch Hole

So in the Case 1, we still punch hole for sector 2, 3 But in the Case 2, since we init the index map to all “1”, so all the sector is Data, we don’t do punch hole. Then there might have some sectors should be Hole because it doesn’t have data in BackingImage either, but we fill the sector with 0 instead of Hole?

GetDataLayout()

Note: If we don’t recreate the partition, the export BackingImage would be consistent…but why? TestCase

cc @shuo-wu @PhanLe1010

ChanYiLin on Jan 10, 2024

Some observations with the following steps:

Preparing a Harvester v1.2.0-head single node cluster (for using volume as an OS root disk)
Uploading Ubuntu jammy cloud image (latest version), img1 with replica number is 1
Checking data usage of img1 (mapped to bi default-image-s9rfj), which is 640MB
Creating volume os-vol (mapped to volume pvc-c3b1a434-86ee-453b-afc7-a52fe113c059) from img1 with claim size 10GB
Creating a VM with os-vol as the root disk
Stopping VM, and waiting until os-vol is completely detached
Exporting os-vol as os-vol.img (mapped to bi default-image-j8ghr) with replica number is 1
Checking replica data usage of os-vol, which is 523MB
Checking data usage of os-vol.img, which is 2507MB
- For the data usage, os-vol.img(2507MB) is much greater than the summation of img1(640MB) and os-vol(523MB)
- The os-vol.img’s data usage shouldn’t be greater than the summation of img1’s and os-vol’s ?
Creating volume copy-vol from os-vol.img with claim size 10GB
Comparing data content between os-vol and copy-vol with $cmp -l os-vol copy-vol
From the compare result cmp.log, there are several continuous byte ranges overwritten with zero in the synthetic volume, copy-vol. And running fsck on copy-vol, fsck will report filesystem errors and require data repair

Thanks

WebberHuang1118 on Oct 27, 2023

After discussing with @ChanYiLin, we agree that Longhorn should return the length the caller wants to read if there is no error EOF

shuo-wu on Jan 17, 2024

sure no problem, thanks for your help! I think we are pretty close to the answer

ChanYiLin on Jan 11, 2024

Awesome. It means somehow the region outside of the backing image boundary (3MB offset to 4MB offset region) in the backing image layer is overwriting the snapshot layer.

Maybe next step would be adding some logs to the preload() function to see if we can identify the faulted logic (why choosing the backing image layer instead of snapshot layer for the region 3MB-4MB)?

PhanLe1010 on Jan 11, 2024

Yes, that is what I observed here.

And the only different between these two is preload()

ChanYiLin on Jan 10, 2024

I can reproduce the issue, will look into how we export BackingImage

cc @shuo-wu

ChanYiLin on Jan 8, 2024

Verified pass on longhorn master(longhorn-engine ac4748), longhorn v1.6.x(longhorn-engine 0b553a) with test steps

I could reproduce this issue before the fix, after the fix, the two volumes data are identical.

chriscchien on Jan 19, 2024

From my perspective, I think we should still return len([data]) for diffDisk.fullReadAt The reason is that, from the caller’s perspective, it is reading from the "Volume"

func (r *Replica) ReadAt(buf []byte, offset int64) (int, error) {
	r.RLock()
	c, err := r.volume.ReadAt(buf, offset) // which will call diffDisk.fullReadAt()
	r.RUnlock()
	return c, err
}

So it has no knowledge about the snapshot chain underneath.

If it reads outside of the boundary of the "Volume", we do return io.ErrUnexpectedEOF error here But if it only reads outside of the boundary of "specific snapshot image", and we allow it to do so, then we should consider it as reading a bunch of “0”, and that is also true. The caller does get a bunch of “0”. The caller should not take care of the situation. It might get confused that it doesn’t get EOF error but how come the data length is not correct.

ChanYiLin on Jan 16, 2024

@PhanLe1010 you can search [DEBUG in syncDataInterval]: Sending batch: [ 512: 1024](512), dataBuffer: in the file that is the data sending from the sorce volume the target volume

Since the sending batch size is 2MB and the sector size is 4KB so the [ 512: 1024] means [2MB: 4MB]

ChanYiLin on Jan 12, 2024

The Sending Batch Size is 2MB, so for 16MB Volume It only sends 8 Batch: [0, 2MB], [2MB, 4MB], …, I printed the read data out and found out that Event it was reading [2MB, 4MB] position from the replicas, the data is all 0 However, I also printed out the reading disk index during the ReadAt(), it was reading from the correct disk. Need to investiage more.

Here is the log lhim2_new2.txt

ChanYiLin on Jan 11, 2024

Sorry @ChanYiLin. I will have to continue tomorrow.

PhanLe1010 on Jan 11, 2024

Hi @PhanLe1010 I did intentionally print the r.volume.location after the preload and also print the segments sending during the exporting Here is the results Screenshot from 2024-01-11 14-00-46

As you can see the preload location I think is correct, there is a series of 2 which means the data from the snapshot ~~But when doing exporting, it only sent one segment [D 0 4096] and then finished~~

Misleading here, the 4096 here means 4096 Blocks, each Block is 4096Bytes It is correct, it is sending 16MB in one interval because all blocks are Data

Note: I add the log message here, this function will be executed by multiple goroutine in parallel when sending segment lhim_log.txt

Note:

VolumeSectorSize = 4096 which is the size of each sector in r.Volume.Location

ChanYiLin on Jan 11, 2024

An other testing idea I am thinking about is:

Create an empty qcow2/raw image file of 1MB

Create a volume with that backing image and the volume size is 16MB

Write some data at the 2MB offset to the volume

Exporting the backing image from the volume

Read and compare the data from the newly exported backing image and the old volume.

Which regions of the data are different?

Does the newly exported backing image contain all zero? -> If yes, it means the backing image layer overwrite the snapshot layer?

Hi @PhanLe1010 Thanks for the idea But we have confirmed that using qcow2 image, if we directly mount the filesystem, without recreating the partition, the exported BackingImage will be consistent. Here is the test case

I have also tested that

Create a VolumeA with the BackingImageA and attached to the host
dd the data into the device directly
Export the VolumeA to another BackingImageB
Create another VolumeB with the BackingImageB
VolumeA and VolumeB will be consistent

This is also a automation test case in Longhorn

ChanYiLin on Jan 11, 2024

The question is, in case 2, location = [2, 1, 1, 2] means [reading data in snap1, reading data in BI, reading data in BI, reading data in snap1 ]. And reading data from BI should get the same result, which is a bunch of zero. In other words, the final data set should be identical…

shuo-wu on Jan 10, 2024

Does the volume clone test and the failed backing image test mean that the new partitioning info can be cloned correctly via VolumeClone but cannot be exported via ExportingBackingImage?

shuo-wu on Jan 10, 2024

And also we found that

mount the BackingImage to the Pod + write some data => Export BackingImage is consistent
mount the BackingImage to the Pod + recreate partition&resizefs + write some data => Export BackingImage is inconsistent and the filesystem is corrupted

So maybe the partition information corrupted when exporting those information should be in the new snapshot image and overwrites the base BackingImage when exporting

ChanYiLin on Jan 10, 2024

longhorn: [BUG] Backing Image Data Inconsistency if it's Exported from a Backing Image Backed Volume

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

About this issue

Most upvoted comments

TEST (By directly returning data instead of data[:n]) - PASSED

Summary

Backing Image Export Test (using empty raw image with larger volume size) - FAILED

Test Result

BackingImage Export (use raw image) - FAILED

Two file sizes are slightly different

Attach two volume to the host and the cksum are not the same

Exported BackingImage with the Volume has no filesystem corrupted issue

The data written in to the device are the same

Conclusion

TEST (By directly returning `data` instead of `data[:n]`) - PASSED