harvester: [BUG] Migrating (or simply stopping) VMs with LVM volumes causes Buffer I/O errors in kernel log of source host

When using LVM within longhorn VM volumes, a migration causes Buffer I/O errors in the kernel log on the source host.

Stopping the VM - which causes its volume to detach from the host - also causes the same issue if the VM starts on a different host.

Here we’re migrating a VM from host 2 to host 1, watching the kernel log:

Relevant target 1 log, getting the volume attached:

[Fri Apr 28 09:11:23 2023] scsi host19: iSCSI Initiator over TCP/IP
[Fri Apr 28 09:11:23 2023] scsi 19:0:0:0: RAID              IET      Controller       0001 PQ: 0 ANSI: 5
[Fri Apr 28 09:11:23 2023] scsi 19:0:0:0: Attached scsi generic sg11 type 12
[Fri Apr 28 09:11:23 2023] scsi 19:0:0:1: Direct-Access     IET      VIRTUAL-DISK     0001 PQ: 0 ANSI: 5
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: Attached scsi generic sg12 type 0
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: Power-on or device reset occurred
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: [sdg] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: [sdg] Write Protect is off
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: [sdg] Mode Sense: 69 00 10 08
[Fri Apr 28 09:11:23 2023] sd 19:0:0:1: [sdg] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Fri Apr 28 09:11:24 2023]  sdg: sdg1 sdg2 < sdg5 >
[Fri Apr 28 09:11:24 2023] sd 19:0:0:1: [sdg] Attached SCSI disk
[Fri Apr 28 09:11:34 2023] loop1: detected capacity change from 0 to 20971520

lvdisplay on target host now shows the relevant logical volumes:

  --- Logical volume ---
  LV Path                /dev/debian-lvm-test-vg/root
  LV Name                root
  VG Name                debian-lvm-test-vg
  LV UUID                6eefp1-29P7-zrCw-MkjU-Ag4Z-wfSi-jghaUM
  LV Write Access        read/write
  LV Creation host, time debian-lvm-test, 2023-04-28 08:56:35 +0000
  LV Status              available
  # open                 0
  LV Size                8.54 GiB
  Current LE             2187
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     1024
  Block device           254:0

  --- Logical volume ---
  LV Path                /dev/debian-lvm-test-vg/swap_1
  LV Name                swap_1
  VG Name                debian-lvm-test-vg
  LV UUID                7FG2OY-rA73-feHT-ZbO3-cepy-c9br-TUcyEB
  LV Write Access        read/write
  LV Creation host, time debian-lvm-test, 2023-04-28 08:56:35 +0000
  LV Status              available
  # open                 0
  LV Size                976.00 MiB
  Current LE             244
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     1024
  Block device           254:1

dmsetup ls on 1 also shows the volumes:

debian--lvm--test--vg-root	(254:0)
debian--lvm--test--vg-swap_1	(254:1)

Relevant 2 log, disconnecting the volume. These errors continue on ~ 30s intervals after the volume is disconnected.

[Fri Apr 28 09:11:48 2023] sd 16:0:0:1: [sdd] Synchronizing SCSI cache
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-1, logical block 249840, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-1, logical block 249840, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-1, logical block 249840, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-1, logical block 249840, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read
[Fri Apr 28 09:11:48 2023] Buffer I/O error on dev dm-0, logical block 2239472, async page read

lvdisplay and dmsetup ls output on source host shows the logical volumes are no longer present, but dmsetup still shows references to the volumes:

2:/home/rancher # lvdisplay
2:/home/rancher # dmsetup ls
debian--lvm--test--vg-root	(254:0)
debian--lvm--test--vg-swap_1	(254:1)

To Reproduce

  1. Install Harvester 1.1.1
  2. Install Debian 11 VM from iso, configure with LVM volumes
  3. After installation, migrate VM to different host, observe errors in kernel log on source host

Debian is not a necessity - we encounter this on every VM that uses LVM, including vendor appliances where we cannot avoid using LVM due to that vendor’s requirements.

Expected behavior

Volume should be cleanly detached from source host in a way that does not leave dangling LVM references if the volume contained LVM volumes.

Support bundle

supportbundle_fe9a860e-fe6b-441b-a237-6ed80bf6f605_2023-04-28T09-47-14Z.zip

Environment

  • Harvester ISO version: 1.1.1
  • Lenovo servers with SATA SSDs. Applies to all tested hardware

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 17 (9 by maintainers)

Most upvoted comments

Hi @oivindoh, yap, @WebberHuang1118 is correct, and I will try to give more explanation for it. Let’s try to reproduce it first.

  1. create VM (I use ubuntu 20.04). Assume that it should be on node-1
  2. create lvm on it and check the status
root@vm001:~# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0         7:0    0 91.9M  1 loop /snap/lxd/24061
loop1         7:1    0 63.3M  1 loop /snap/core20/1852
loop2         7:2    0 53.2M  1 loop /snap/snapd/18933
vda         254:0    0   10G  0 disk
├─vda1      254:1    0  9.9G  0 part /
├─vda14     254:14   0    4M  0 part
└─vda15     254:15   0  106M  0 part /boot/efi
vdb         254:16   0    5G  0 disk
└─vg01-lv01 253:0    0    2G  0 lvm
vdc         254:32   0    1M  0 disk
root@vm001:~# lvdisplay
  --- Logical volume ---
  LV Path                /dev/vg01/lv01
  LV Name                lv01
  VG Name                vg01
  LV UUID                lYUhhe-WWhh-3OWq-f8eD-foLa-qyw5-xcaXM9
  LV Write Access        read/write
  LV Creation host, time vm001, 2023-05-05 03:21:58 +0000
  LV Status              available
  # open                 0
  LV Size                2.00 GiB
  Current LE             512
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0
  1. Check on the node-1
harvester-node-1:~ # lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0     7:0    0     3G  1 loop /
loop1     7:1    0     5G  0 loop
loop2     7:2    0    10G  0 loop
sda       8:0    0     5G  0 disk
.
.
.

It looks like vg does not active vg now (because no one trigger). Then try to migrate 4. migrate to node-2 Check the disk on node-2

harvester-node-2:~ # lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0     3G  1 loop /
sda           8:0    0    10G  0 disk
├─sda1        8:1    0   9.9G  0 part
├─sda14       8:14   0     4M  0 part
└─sda15       8:15   0   106M  0 part
sdb           8:16   0     5G  0 disk
└─vg01-lv01 254:0    0     2G  0 lvm

Now the vg is active, so we migrate to node-1 again. 5. migrate to node-1 and check the node-2 kernel log when it migrated

[59319.203763] sd 3:0:0:1: [sdb] Synchronizing SCSI cache
[59319.284350] sd 2:0:0:1: [sda] Synchronizing SCSI cache
[59319.293832] Buffer I/O error on dev dm-0, logical block 524272, async page read
[59319.295263] Buffer I/O error on dev dm-0, logical block 524272, async page read
[59319.334743] Buffer I/O error on dev dm-0, logical block 524272, async page read
[59319.336380] Buffer I/O error on dev dm-0, logical block 524272, async page read
[59319.380665] Buffer I/O error on dev dm-0, logical block 524272, async page read
[59319.382107] Buffer I/O error on dev dm-0, logical block 524272, async page read
[59323.635766] Buffer I/O error on dev dm-0, logical block 524272, async page read
[59323.648009] Buffer I/O error on dev dm-0, logical block 524272, async page read
[59323.660245] Buffer I/O error on dev dm-0, logical block 524272, async page read

You can see the IO error is the same as your situation.

Why the IO Error occurs? The Longhorn CSI driver will detach its volume on node-2 because it would be attached to node-1 (migration, as you know). Once the base disk (longhorn volume) is detached (which means your pv is gone), any upper-level IO would cause the IO error.

Like your support bundle, we could find the following log.

May 03 17:59:34 2 systemd[1]: Starting Open-iSCSI...
May 03 17:59:34 2 systemd[1]: Started Open-iSCSI.
May 03 17:59:34 2 iscsid[10156]: iscsid: Connection1:0 to [target: iqn.2019-10.io.longhorn:pvc-af728fb2-7c3d-4ac2-8f0c-8db96f441450, portal: 10.52.2.77,3260] through [iface: default] is operational now
May 03 18:00:56 2 iscsid[10156]: iscsid: Connection1:0 to [target: iqn.2019-10.io.longhorn:pvc-af728fb2-7c3d-4ac2-8f0c-8db96f441450, portal: 10.52.2.77,3260] through [iface: default] is shutdown.

Then the kernel log shows the IO error after iscsi logout (means longhorn detaches this volume).

May 03 18:00:56 2 kernel: sd 16:0:0:1: [sdd] Synchronizing SCSI cache
May 03 18:00:56 2 kernel: Buffer I/O error on dev dm-0, logical block 2239472, async page read
May 03 18:00:56 2 kernel: Buffer I/O error on dev dm-0, logical block 2239472, async page read
May 03 18:00:56 2 kernel: Buffer I/O error on dev dm-1, logical block 249840, async page read
May 03 18:00:56 2 kernel: Buffer I/O error on dev dm-1, logical block 249840, async page read
May 03 18:00:57 2 kernel: Buffer I/O error on dev dm-0, logical block 2239472, async page read
May 03 18:00:57 2 kernel: Buffer I/O error on dev dm-0, logical block 2239472, async page read
May 03 18:00:57 2 kernel: Buffer I/O error on dev dm-1, logical block 249840, async page read
May 03 18:00:57 2 kernel: Buffer I/O error on dev dm-1, logical block 249840, async page read
May 03 18:00:57 2 kernel: Buffer I/O error on dev dm-0, logical block 2239472, async page read
May 03 18:00:57 2 kernel: Buffer I/O error on dev dm-0, logical block 2239472, async page read

The correct way is that someone needs to deactivate this lv/vg before Longhorn detaches volume. But because you are config LVM on the VM, finding someone to deactivate it is another story.

We still have a solution for it. That means we do not allow the vg active automatically on the host. That makes sense because you want the LVM for the VM, not the host. You can find configurations like filter or global_filter on the lvm.conf in the host.

Using that to prevent the vg/lv active automatically.

For the simple example, I config as below global_filter = [ "r|.*/|" ] Then we could migrate again to see.

harvester-node-2:~ # lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0     7:0    0     3G  1 loop /
loop1     7:1    0    10G  0 loop
loop2     7:2    0     5G  0 loop
sda       8:0    0    10G  0 disk
├─sda1    8:1    0   9.9G  0 part
├─sda14   8:14   0     4M  0 part
└─sda15   8:15   0   106M  0 part
sdc       8:32   0     5G  0 disk
.
.
.

Now the vg is not activated. We could migrate it back to node-1 and did not see any io error.

Hi @oivindoh,

IMHO, since the LVM devices are configured on the VM, the LVM behavior could be not defined on nodes.

The reason why the dm devices are still shown on nodes, maybe it’s related to lvmetad. It automatically activates logical volumes or entire volume groups as they become available to the system. Could you provide the lvm configure and vg configure on nodes?