longhorn: [BUG] Backing Image Data Inconsistency if it's Exported from a Backing Image Backed Volume

Describe the bug (🐛 if you encounter this issue)

The volume encounters a data integrity issue, if it’s from a backing image, and such backing image is exported by a volume backed by another backing image.

To Reproduce

  1. Preparing a Harvester v1.2.0-head cluster (for using volume as an OS root disk)
  2. Uploading Ubuntu jammy cloud image (latest version), img1
  3. Creating volume vol1 from img1
  4. Creating a VM with vol1 as the root disk, and writing some data
  5. Issuing sync on filesystem, e.q. $ sudo sync
  6. Stopping VM, and waiting until vol1 is completely detached
  7. Exporting vol1 as img2
  8. Creating vol2 from img2
  9. Comparing data content between vol1 and vol2, the data is inconsistent
  10. Running e2fsck -v on vol2, fsck will report filesystem errors and require data repair

Expected behavior

The data of vol1 and vol2 should be consistent.

Support bundle for troubleshooting

supportbundle_066aa8ed-a12b-4ccc-971c-59404ee44cd4_2023-10-16T07-46-44Z.zip

Environment

  • Longhorn version: v1.4.3
  • Harvester version: v1.2.0-head
  • Number of Longhorn volumes in the cluster: 2
  • Impacted Longhorn resources:
    • Volume names: pvc-b8bd9013-e8e9-449d-9c25-6c6c0824d0b3, pvc-de64dd95-b0c1-4ce9-864a-88d38e06d9d3

Additional context

  • If vol1 is not backed by backing image, there is no data inconsistency found between vol1 and vol2
  • A single node Harvester cluster should be enough to reproduce this issue

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Reactions: 2
  • Comments: 45 (41 by maintainers)

Most upvoted comments

I think I found the bug When reading [2MB-4MB] from the Volume

[2MB-3MB] is BackingImage
[3MB-4MB] is snap1

https://github.com/longhorn/longhorn-engine/blob/deb8b18a1558ddf4f359146b289d5469bef7cc6d/pkg/replica/diff_disk.go#L202-L241 Here when we try to read the data

  • If the target disk is out of bound, it can because of Volume Expansion or BackingImage.
    • In this case the preload location indicated that [2MB-3MB] is BackingImage and tried to read from it.
  • Then we directly returned 0 count because we cannot read any data from BackingImage to the buf in this range
	// May read the expanded part
	if offset >= size {
		return 0, nil
	}
  • For [3MB-4MB], the target disk was correct, and we read the data from snap1 to the buf and the count would be 1MB
  • But we cut the return buf with count after reading from the file here. So the return data abandoned the data we read from the snap1 and only leave the first half of “0” data.
func ReadDataInterval(rw ReaderWriterAt, dataInterval Interval) ([]byte, error) {
	data := AllocateAligned(int(dataInterval.Len()))
	n, err := rw.ReadAt(data, dataInterval.Begin)
	if err != nil {
		if err == io.EOF {
			log.Debugf("Have read at the end of file, total read: %d", n)
		} else {
			log.WithError(err).Errorf("Failed to read interval %+v", dataInterval)
			return nil, errors.Wrapf(err, "failed to read interval %+v", dataInterval)
		}
	}
	return data[:n], nil
}

I wonder do we really need to cut it with count?
Or is the alignment here correct? Or maybe we should still return the buf length count when reading from expanded part like we are reading a bunch of “0” instead of 0 count.

cc @shuo-wu @PhanLe1010 @WebberHuang1118

I found a method to reproduce a similar case without Harvester-specific operations (no VM operation, only creating PVC from backing image and I/O access)

Detailed steps:

  • Uploading Ubuntu jammy cloud image (latest version), img1
  • Creating PVC os-vol from img1 with claim size 10GB and 1 replica number
  • Applying the following yaml to mount os-vol as block mode in the pod
    apiVersion: v1
    kind: Pod
    metadata:
      name: block-volume-test
      namespace: default
    spec:
      containers:
      - name: volume-test
        image: ubuntu
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        command: ["/bin/sleep"]
        args: ["3600"]
        volumeDevices:
        - devicePath: /dev/os-vol
          name: os-vol
      volumes:
        - name: os-vol
          persistentVolumeClaim:
            claimName: os-vol
    
  • Issuing filesystem expand and I/O on os-vol
    • Enter pod block-volume-test
    • Install fdisk
      • $ apt update; apt install -y fdisk
    • Expanding filesystem on os-vol
      • Check device (/dev/sda for exmaple) mapped to os-vol by device number
      • $ fdisk /dev/sda
        • d 1 n 1 default default No w
      • $ resize2fs -fp /dev/sda1
    • Mounting /dev/sda1 and writing I/O into it
      • $ mkdir /data; $ mount -t ext4 /dev/sda1 /data
      • $ dd if=/dev/zero of=/data/file bs=1M count=2048
    • Flushing data to storage for os-vol
      • $ umount /data
    • Delete pod block-volume-test, and waiting until os-vol’s LH volume detach
  • Exporting os-vol as image os-vol.img
  • Creating PVC copy-vol from os-vol.img with claim size 10GB
  • Comparing data content between os-vol and copy-vol with $cmp -l os-vol copy-vol
    • The data content is different between the two PVCs
  • Finding filesystem error on copy-vol with $e2fsck -c /dev/sda1 (assume PVC copy-vol mapped to /dev/sda)
    • There is no filesystem error on os-vol

PS:

  • IMHO, it may be related resizefs, since it contains more complicated I/O patterns, including discard to image block ranges.

TEST (By directly returning data instead of data[:n]) - PASSED

The Volume are now consistent even testing with the original Test Case

  • cksum are the same
  • filesystem is not corrupted

Summary

I think this is the correct fix. When sending the batch with the interval, we allocate the buf with the batch size (interval size) then we send the data with the interval information to the target volume to write the data in that interval we should not cut the data with the count we actually read from volume because the count might be 0 in case of volume expansion or BackingImage And I don’t think in this case the count number matters

Hi @PhanLe1010 I think you are right

Backing Image Export Test (using empty raw image with larger volume size) - FAILED

  • Create an empty raw image and upload as a BackingImage empty-raw
    • $ truncate -s 2M empty-raw
  • Create a 16MB Volume test-empty with the raw image
  • Attach to the host
  • dd 1MB data starts from 3MB offset
    • $ dd of=/dev/longhorn/test-empty if=/dev/urandom bs=1M count=1 oflag=seek_bytes seek=3145728
  • export the Volume to another BackingImage empty-raw-export
  • Create a 16MB Volume test-empty-export with the BackingImage empty-raw-export
  • Attach to the host

Test Result

test-empty-export is empty and the comparison result shows that it dffers from 3MB offset

brw-rw----  1 root root 8, 48 Jan 11 05:35 test-empty
brw-rw----  1 root root 8, 64 Jan 11 05:36 test-empty-export
root@kworker1:/dev/longhorn# cat test-empty-export
root@kworker1:/dev/longhorn# cmp test-empty test-empty-export
test-empty test-empty-export differ: byte 3145729, line 1
root@kworker1:/dev/longhorn# 

cc @shuo-wu

Hi @PhanLe1010 Thanks for the advice! Here is the testing results

BackingImage Export (use raw image) - FAILED

  1. Create a raw image (I convert the jammy image from qcow2 to raw) and a Volume with it
  2. Mount to the Pod
  3. fdisk recreate the partition and resize2fs to resize the filesystem
  4. Mounting device and writing I/O into it
    • $ mkdir /data; $ mount -t ext4 /dev/sdb1 /data
    • $ dd if=/dev/zero of=/data/file bs=1M count=2048
    • $ sync
  5. Export the Volume to another BackingImage
  6. Create a Volume with the BackingImage

Two file sizes are slightly different

root@kworker2:/var/lib/longhorn# du -sm replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/*
1	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/revision.counter
0	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume-head-001.img
1	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume-head-001.img.meta
2141	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume-snap-test-export-806bbe7b.img
1	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume-snap-test-export-806bbe7b.img.meta
1	replicas/pvc-b334ec5f-ab8f-4285-bffe-743b542ec494-9a715c65/volume.meta

root@kworker2:/var/lib/longhorn# du -sm backing-images/test-raw-61fde503/*
1517	backing-images/test-raw-61fde503/backing
1	backing-images/test-raw-61fde503/backing.cfg

root@kworker2:/var/lib/longhorn# du -sm backing-images/test-export-c504be3c/*
3868	backing-images/test-export-c504be3c/backing
1	backing-images/test-export-c504be3c/backing.cfg

Attach two volume to the host and the cksum are not the same

root@kworker2:/dev/longhorn# ll
total 0
drwxr-xr-x  2 root root    80 Jan 11 03:24 ./
drwxr-xr-x 20 root root  4260 Jan 11 03:24 ../
brw-rw----  1 root root 8, 32 Jan 11 03:24 pvc-6a0211dd-09d8-4bb7-a95e-d054988dd958
brw-rw----  1 root root 8, 16 Jan 11 03:17 pvc-b334ec5f-ab8f-4285-bffe-743b542ec494
root@kworker2:/dev/longhorn# cksum pvc-b334ec5f-ab8f-4285-bffe-743b542ec494 
969496369 5368709120 pvc-b334ec5f-ab8f-4285-bffe-743b542ec494
root@kworker2:/dev/longhorn# cksum pvc-6a0211dd-09d8-4bb7-a95e-d054988dd958
2208768380 5368709120 pvc-6a0211dd-09d8-4bb7-a95e-d054988dd958

Exported BackingImage with the Volume has no filesystem corrupted issue

root@block-volume-test:/# ls -al /dev/os-export 
brw-rw---- 1 root root 8, 32 Jan 11 03:35 /dev/os-export
root@block-volume-test:/# ls -al /dev/sdc       
brw-rw---- 1 root disk 8, 32 Jan 11 03:35 /dev/sdc
root@block-volume-test:/# e2fsck -c /dev/sdc1
e2fsck 1.46.5 (30-Dec-2021)
Checking for bad blocks (read-only test): done                                                 
cloudimg-rootfs: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

cloudimg-rootfs: ***** FILE SYSTEM WAS MODIFIED *****
cloudimg-rootfs: 73666/645120 files (0.0% non-contiguous), 949441/1282299 blocks

The data written in to the device are the same

root@block-volume-test:/# mkdir /data
root@block-volume-test:/# mkdir /data-export
root@block-volume-test:/# mount -t ext4 /dev/sdb1 /data
root@block-volume-test:/# mount -t ext4 /dev/sdc1 /data-export
root@block-volume-test:/# cksum /data/file
2532515601 2147483648 /data/file
root@block-volume-test:/# cksum /data-export/file
2532515601 2147483648 /data-export/file

Conclusion

  • With using raw image, two volumes are still not consistent, and the size are different but almost the same.
  • Exported BackingImage is slightly larger
  • The filesystem in this case is not corrupted

cc @shuo-wu

This example assumes that the logical (virtual) size of the backing image is equal to the logical size of snapshot:

assuming snapshot chain as follow
0. nil
1. BackingImage [1, 1, 0, 0]
2. Snap1 [1, 0, 0, 1]


Case 1 - Volume Clone: Export without BackingImage
first init to all 0, location = [0, 0, 0, 0]
r.volumes.preload(), location:
i = 0, nil                          skip => location = [0, 0, 0, 0]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [0, 0, 0, 0]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 0, 0, 2]

ExportVolume and GetDataLayout=> [2, 0, 0, 2] which is [Data, Hole, Hole, Data]

Case 2 - BackingImage Export: Export with BackingImage
first init to all 1, location = [1, 1, 1, 1]
r.volumes.preload(), location
i = 0, nil                          skip => location = [1, 1, 1, 1]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [1, 1, 1, 1]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 1, 1, 2]

ExportVolume and GetDataLayout => [2, 1, 1, 2] which is [Data, Data, Data, Data]

In reality, they are not:

  • The ubuntu image has 2.2 GiB logical size:
    ➜  Downloads qemu-img info jammy-server-cloudimg-amd64.img 
    image: jammy-server-cloudimg-amd64.img
    file format: qcow2
    virtual size: 2.2 GiB (2361393152 bytes)
    disk size: 642 MiB
    cluster_size: 65536
    Format specific information:
        compat: 0.10
        compression type: zlib
        refcount bits: 16
    
  • While the Longhorn snapshot has 10GB logical size

Revisiting 2 cases:

  1. mount the BackingImage to the Pod + write some data => Export BackingImage is consistent -> In this case, all the write fall into the first 2.2GB region of the backing image file
  2. mount the BackingImage to the Pod + recreate partition&resizefs + write some data => Export BackingImage is inconsistent and the filesystem is corrupted -> In this case, some data might fall outside of the 2.2GB region of the backing image file. So something like:
    assuming snapshot chain as follow
    0. nil
    1. BackingImage [1, 1, 0, 0]
    2. Snap1        [1, 0, 0, 1, 0, 1, 0]
    
    I am wondering if reading data from the qcow2 backing image outside of its virtual size can cause any inconsistency? Could we test with a raw backing image instead to see if we can hit the same issue @ChanYiLin ?

I am thinking if it is because

assuming snapshot chain as follow
0. nil
1. BackingImage [1, 1, 0, 0]
2. Snap1 [1, 0, 0, 1]


Case 1 - Volume Clone: Export without BackingImage
first init to all 0, location = [0, 0, 0, 0]
r.volumes.preload(), location:
i = 0, nil                          skip => location = [0, 0, 0, 0]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [0, 0, 0, 0]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 0, 0, 2]

ExportVolume and GetDataLayout=> [2, 0, 0, 2] which is [Data, Hole, Hole, Data]

Case 2 - BackingImage Export: Export with BackingImage
first init to all 1, location = [1, 1, 1, 1]
r.volumes.preload(), location
i = 0, nil                          skip => location = [1, 1, 1, 1]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [1, 1, 1, 1]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 1, 1, 2]

ExportVolume and GetDataLayout => [2, 1, 1, 2] which is [Data, Data, Data, Data]

When SyncContent() and GetDataLayout(), we actually check if the sector is Data or Hole by checking if the DiskIndex is 0 or non-0 If it is Data, we read the data from the replica of the interval and send to the target If it is Hole, we simply ask the target to punch Hole

So in the Case 1, we still punch hole for sector 2, 3 But in the Case 2, since we init the index map to all “1”, so all the sector is Data, we don’t do punch hole. Then there might have some sectors should be Hole because it doesn’t have data in BackingImage either, but we fill the sector with 0 instead of Hole?

GetDataLayout()

Note: If we don’t recreate the partition, the export BackingImage would be consistent…but why? TestCase

cc @shuo-wu @PhanLe1010

Some observations with the following steps:

  1. Preparing a Harvester v1.2.0-head single node cluster (for using volume as an OS root disk)
  2. Uploading Ubuntu jammy cloud image (latest version), img1 with replica number is 1
  3. Checking data usage of img1 (mapped to bi default-image-s9rfj), which is 640MB data_usage_origin_img
  4. Creating volume os-vol (mapped to volume pvc-c3b1a434-86ee-453b-afc7-a52fe113c059) from img1 with claim size 10GB
  5. Creating a VM with os-vol as the root disk
  6. Stopping VM, and waiting until os-vol is completely detached
  7. Exporting os-vol as os-vol.img (mapped to bi default-image-j8ghr) with replica number is 1
  8. Checking replica data usage of os-vol, which is 523MB data_usage_os_vol
  9. Checking data usage of os-vol.img, which is 2507MB data_usage_os_vol_img
    • For the data usage, os-vol.img(2507MB) is much greater than the summation of img1(640MB) and os-vol(523MB)
    • The os-vol.img’s data usage shouldn’t be greater than the summation of img1’s and os-vol’s ?
  10. Creating volume copy-vol from os-vol.img with claim size 10GB
  11. Comparing data content between os-vol and copy-vol with $cmp -l os-vol copy-vol
  12. From the compare result cmp.log, there are several continuous byte ranges overwritten with zero in the synthetic volume, copy-vol. And running fsck on copy-vol, fsck will report filesystem errors and require data repair

Thanks

After discussing with @ChanYiLin, we agree that Longhorn should return the length the caller wants to read if there is no error EOF

sure no problem, thanks for your help! I think we are pretty close to the answer

Awesome. It means somehow the region outside of the backing image boundary (3MB offset to 4MB offset region) in the backing image layer is overwriting the snapshot layer.

Maybe next step would be adding some logs to the preload() function to see if we can identify the faulted logic (why choosing the backing image layer instead of snapshot layer for the region 3MB-4MB)?

Yes, that is what I observed here.

And the only different between these two is preload()

I can reproduce the issue, will look into how we export BackingImage

cc @shuo-wu

Verified pass on longhorn master(longhorn-engine ac4748), longhorn v1.6.x(longhorn-engine 0b553a) with test steps

I could reproduce this issue before the fix, after the fix, the two volumes data are identical.

From my perspective, I think we should still return len([data]) for diffDisk.fullReadAt The reason is that, from the caller’s perspective, it is reading from the "Volume"

func (r *Replica) ReadAt(buf []byte, offset int64) (int, error) {
	r.RLock()
	c, err := r.volume.ReadAt(buf, offset) // which will call diffDisk.fullReadAt()
	r.RUnlock()
	return c, err
}

So it has no knowledge about the snapshot chain underneath.

If it reads outside of the boundary of the "Volume", we do return io.ErrUnexpectedEOF error here But if it only reads outside of the boundary of "specific snapshot image", and we allow it to do so, then we should consider it as reading a bunch of “0”, and that is also true. The caller does get a bunch of “0”. The caller should not take care of the situation. It might get confused that it doesn’t get EOF error but how come the data length is not correct.

@PhanLe1010 you can search [DEBUG in syncDataInterval]: Sending batch: [ 512: 1024](512), dataBuffer: in the file that is the data sending from the sorce volume the target volume

Since the sending batch size is 2MB and the sector size is 4KB so the [ 512: 1024] means [2MB: 4MB]

The Sending Batch Size is 2MB, so for 16MB Volume It only sends 8 Batch: [0, 2MB], [2MB, 4MB], …, I printed the read data out and found out that Event it was reading [2MB, 4MB] position from the replicas, the data is all 0 However, I also printed out the reading disk index during the ReadAt(), it was reading from the correct disk. Need to investiage more.

Here is the log lhim2_new2.txt

Sorry @ChanYiLin. I will have to continue tomorrow.

Hi @PhanLe1010 I did intentionally print the r.volume.location after the preload and also print the segments sending during the exporting Here is the results Screenshot from 2024-01-11 14-00-46

As you can see the preload location I think is correct, there is a series of 2 which means the data from the snapshot But when doing exporting, it only sent one segment [D 0 4096] and then finished

Misleading here, the 4096 here means 4096 Blocks, each Block is 4096Bytes It is correct, it is sending 16MB in one interval because all blocks are Data

Note: I add the log message here, this function will be executed by multiple goroutine in parallel when sending segment lhim_log.txt

Note:

  • VolumeSectorSize = 4096 which is the size of each sector in r.Volume.Location

An other testing idea I am thinking about is:

  1. Create an empty qcow2/raw image file of 1MB

  2. Create a volume with that backing image and the volume size is 16MB

  3. Write some data at the 2MB offset to the volume

  4. Exporting the backing image from the volume

  5. Read and compare the data from the newly exported backing image and the old volume.

    1. Which regions of the data are different?
    2. Does the newly exported backing image contain all zero? -> If yes, it means the backing image layer overwrite the snapshot layer?

Hi @PhanLe1010 Thanks for the idea But we have confirmed that using qcow2 image, if we directly mount the filesystem, without recreating the partition, the exported BackingImage will be consistent. Here is the test case

I have also tested that

  • Create a VolumeA with the BackingImageA and attached to the host
  • dd the data into the device directly
  • Export the VolumeA to another BackingImageB
  • Create another VolumeB with the BackingImageB
  • VolumeA and VolumeB will be consistent

This is also a automation test case in Longhorn

The question is, in case 2, location = [2, 1, 1, 2] means [reading data in snap1, reading data in BI, reading data in BI, reading data in snap1 ]. And reading data from BI should get the same result, which is a bunch of zero. In other words, the final data set should be identical…

Does the volume clone test and the failed backing image test mean that the new partitioning info can be cloned correctly via VolumeClone but cannot be exported via ExportingBackingImage?

And also we found that

  1. mount the BackingImage to the Pod + write some data => Export BackingImage is consistent
  2. mount the BackingImage to the Pod + recreate partition&resizefs + write some data => Export BackingImage is inconsistent and the filesystem is corrupted

So maybe the partition information corrupted when exporting those information should be in the new snapshot image and overwrites the base BackingImage when exporting