harvester: [BUG] Migrating (or simply stopping) VMs with LVM volumes causes Buffer I/O errors in kernel log of source host
When using LVM within longhorn VM volumes, a migration causes Buffer I/O errors in the kernel log on the source host.
Stopping the VM - which causes its volume to detach from the host - also causes the same issue if the VM starts on a different host.
Here we’re migrating a VM from host 2 to host 1, watching the kernel log:
Relevant target 1 log, getting the volume attached:
[Fri Apr 28 09:11:23 2023] scsi host19: iSCSI Initiator over TCP/IP
[Fri Apr 28 09:11:23 2023] scsi 19:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
[Fri Apr 28 09:11:23 2023] scsi 19:0:0:0: Attached scsi generic sg11 type 12
[Fri Apr 28 09:11:23 2023] scsi 19:0:0:1: Direct-Access IET VIRTUAL-DISK 0001 PQ: 0 ANSI: 5
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: Attached scsi generic sg12 type 0
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: Power-on or device reset occurred
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: [sdg] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: [sdg] Write Protect is off
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: [sdg] Mode Sense: 69 00 10 08
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: [sdg] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Fri Apr 28 09:11:24 2023] sdg: sdg1 sdg2 < sdg5 >
[Fri Apr 28 09:11:24 2023] sd 19:0:0:1: [sdg] Attached SCSI disk
[Fri Apr 28 09:11:34 2023] loop1: detected capacity change from 0 to 20971520
lvdisplay on target host now shows the relevant logical volumes:
--- Logical volume ---
LV Path /dev/debian-lvm-test-vg/root
LV Name root
VG Name debian-lvm-test-vg
LV UUID 6eefp1-29P7-zrCw-MkjU-Ag4Z-wfSi-jghaUM
LV Write Access read/write
LV Creation host, time debian-lvm-test, 2023-04-28 08:56:35 +0000
LV Status available
# open 0
LV Size 8.54 GiB
Current LE 2187
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 1024
Block device 254:0
--- Logical volume ---
LV Path /dev/debian-lvm-test-vg/swap_1
LV Name swap_1
VG Name debian-lvm-test-vg
LV UUID 7FG2OY-rA73-feHT-ZbO3-cepy-c9br-TUcyEB
LV Write Access read/write
LV Creation host, time debian-lvm-test, 2023-04-28 08:56:35 +0000
LV Status available
# open 0
LV Size 976.00 MiB
Current LE 244
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 1024
Block device 254:1
dmsetup ls on 1 also shows the volumes:
debian--lvm--test--vg-root (254:0)
debian--lvm--test--vg-swap_1 (254:1)
Relevant 2 log, disconnecting the volume. These errors continue on ~ 30s intervals after the volume is disconnected.
[Fri Apr 28 09:11:48 2023] sd 16:0:0:1: [sdd] Synchronizing SCSI cache
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-1, logical block 249840, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-1, logical block 249840, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-1, logical block 249840, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-1, logical block 249840, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
lvdisplay and dmsetup ls output on source host shows the logical volumes are no longer present, but dmsetup still shows references to the volumes:
2:/home/rancher # lvdisplay
2:/home/rancher # dmsetup ls
debian--lvm--test--vg-root (254:0)
debian--lvm--test--vg-swap_1 (254:1)
To Reproduce
- Install Harvester 1.1.1
- Install Debian 11 VM from iso, configure with LVM volumes
- After installation, migrate VM to different host, observe errors in kernel log on source host
Debian is not a necessity - we encounter this on every VM that uses LVM, including vendor appliances where we cannot avoid using LVM due to that vendor’s requirements.
Expected behavior
Volume should be cleanly detached from source host in a way that does not leave dangling LVM references if the volume contained LVM volumes.
Support bundle
supportbundle_fe9a860e-fe6b-441b-a237-6ed80bf6f605_2023-04-28T09-47-14Z.zip
Environment
- Harvester ISO version: 1.1.1
- Lenovo servers with SATA SSDs. Applies to all tested hardware
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 17 (9 by maintainers)
Hi @oivindoh, yap, @WebberHuang1118 is correct, and I will try to give more explanation for it. Let’s try to reproduce it first.
It looks like vg does not active vg now (because no one trigger). Then try to migrate 4. migrate to node-2 Check the disk on node-2
Now the vg is active, so we migrate to node-1 again. 5. migrate to node-1 and check the node-2 kernel log when it migrated
You can see the IO error is the same as your situation.
Why the IO Error occurs? The Longhorn CSI driver will detach its volume on node-2 because it would be attached to node-1 (migration, as you know). Once the base disk (longhorn volume) is detached (which means your pv is gone), any upper-level IO would cause the IO error.
Like your support bundle, we could find the following log.
Then the kernel log shows the IO error after iscsi logout (means longhorn detaches this volume).
The correct way is that someone needs to deactivate this lv/vg before Longhorn detaches volume. But because you are config LVM on the VM, finding someone to deactivate it is another story.
We still have a solution for it. That means we do not allow the vg active automatically on the host. That makes sense because you want the LVM for the VM, not the host. You can find configurations like
filterorglobal_filteron thelvm.confin the host.Using that to prevent the vg/lv active automatically.
For the simple example, I config as below
global_filter = [ "r|.*/|" ]Then we could migrate again to see.Now the vg is not activated. We could migrate it back to node-1 and did not see any io error.
Hi @oivindoh,
IMHO, since the LVM devices are configured on the VM, the LVM behavior could be not defined on nodes.
The reason why the dm devices are still shown on nodes, maybe it’s related to lvmetad. It automatically activates logical volumes or entire volume groups as they become available to the system. Could you provide the lvm configure and vg configure on nodes?