kata-containers: Use of qemu PCIe to PCI bridges on x86 causes multiple problems

Description of problem

On x86 systems, the Kata qemu backend always adds a PCI bridge. This is qemu’s pci-bridge device type, which will be a PCI to PCI bridge on the pc machine type and PCIe to PCI on q35. Hotplugged block, net and vhost-user devices always go on this bridge. Hotplugged VFIO devices also go on the bridge by default (if hotplug_vfio_on_root_bus=true is not set).

Using a bridge like this conveniently provides 32 pluggable slots and makes pc and q35 behave more similarly. However, at least with current qemu and guest kernel it forces use of the SHPC hotplug protocol, which has some severe drawbacks.

SHPC is designed with physical devices and human operators in mind, and so has a 5s delay built into the protocol to allow accidental plugs to be reversed. Since a 5s delay in startup isn’t acceptable, we work around this in the agent by forcing a PCI rescan which bypasses the proper operation of SHPC and locates the device early. Unfortunately, that workaround causes other problems

An SHPC hotplug can sometimes race with the rescan in a way which means that the guest kernel misinterprets the SHPC interrupt as a request to unplug the device, meaning the device appears very briefly then is removed again
For passed through VFIO devices, an even more severe error can occur which can put the device into an unusable state until a host reboot.

For these reasons #683 exists to remove the rescan from the agent, however doing so means SHPC hotplugs will have an unacceptable delay.

For VFIO devices, the use of a PCI bridge has additional problems beyond SHPC.

If the passed through device is PCI-Express (essentially all likely use cases), then putting it under the bridge means the guest will see it as a plain PCI device instead. Depending on the device this may mean that the guest can’t drive it properly.
Because pre-Express PCI bridges don’t preserve Requestor IDs, all devices behind a PCI bridge will always be in the same IOMMU group in the guest, even if they are in separate groups in the host. That means
- If the container was intending to use separate VFIO devices from separate userspace drivers (e.g. DPDK), they won’t be able to
- Worse, if the container has both attached block and net devices (managed by the guest kernel) and vfio devices they intended to use with userspace drivers, they won’t be able to at all (a whole IOMMU group must belong either to the kernel or to userspace).

About this issue

Original URL
State: open
Created 3 years ago
Comments: 18 (14 by maintainers)

Commits related to this issue

vendor: update OpenTelemetry to v0.20.0 Update OpenTelemetry from v0.15.0 to v0.20.0. Git log 02d8bdd5 Release v0.20.0 (#1837) aa66fe75 OS and Process resource detectors (#1788) 737... — committed to cmaf/kata-containers by cmaf 3 years ago

Most upvoted comments

@dgibson @fidencio @devimc I am sorry but I cannot make it to today’s architecture meeting. Anyway, everything that is being discussed here is pending research before considering it.

marcel-apf on Apr 27, 2021

@marcel-apf @fidencio for your attention.

dgibson on Apr 22, 2021