kvm-guest-drivers-windows: netkvm with lots of parallel network activity hanging Win10 VM
HangCallStacks-Full.txt Scenario summary:
Ubuntu 22.04 running Oracle Virtual Box with a Win10 VM running underneath. 4 cores & 16 GB RAM allocated to the VM, 32 GB total in the machine. We believe via crash dump analysis that the NetKVM / VirtIO drivers are hanging up spinlocked & waiting for virtqueue_get_buf_split.
Details (sorry, this is long - I’m summarizing as best I can):
The VM is running our own code that previously ran in Win7 and runs in Win10 stand alone fine, now we are migrating portions (eventually all of it) over to Linux but for now are in a hybrid environment. The hardware emulated NICs weren’t giving the performance we wanted as this main system has 2 x 10G fiber interfaces and when we tried NetKVM / VirtIO drivers things performed sooooo much better so that is where we went.
With ~30 devices on the network every now and then we would occasionally see some kind of VM hang up. VirtualBox would say:
VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago. We thought it was a virtualbox or file system problem as we saw the hang when 30-50 devices were all trying to pull files from the main server.
On a system with ~200+ devices on the network it is hanging up more often and now not when they are pulling files but when the system is “idle”. I quote “Idle” because even when idle the server is constantly querying every device every few seconds with a ping and or a ‘How are you doing today?’ message, devices are constantly saying what their status is via multicast messages that are picked up by any interested listening device and most are listening to the multicast messages.
Thus even an idle system while not a high GB/s amount of network traffic is a lot of messages per unit time, a quick sniff and I counted > 350 packets per second.
With the hang happening more often I’ve instrumented the Win10 VM to be able to generate a crash dump on NMI.
Hang: VM seems running but is completely unresponsive to any keyboard or mouse input, the clock has stopped updating on the bottom right of the Win10 screen, network stack does not respond to any pings, we see no data egressing from the VM when checked with wireshark. The VM does respond to the NMI, so I’ve got that going for me.
A lot of hoops and spelunking later I managed to find a symbol file for the version of netkvm we were running, get it recognized by WinDBG, and can finally show call stacks for the 4 VCPUs with actual function names and line numbers VS what I was seeing.
Here is the call stack that received the NMI - I filtered out all the not interesting hex stuff - so the first 8 stack entries are all related to the NMI and generating the crash dump. Right under that it is in netkvm:
0: kd> !thread
THREAD fffff80369727a00 Cid 0000.0000 Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap ffffb20b8ee36540
Owning Process fffff80369724a00 Image: Idle
Attached Process ffffde034267d040 Image: System
Wait Start TickCount 128224 Ticks: 334059 (0:01:26:59.671)
Context Switch Count 528260 IdealProcessor: 0
UserTime 00:00:00.000
KernelTime 00:29:47.765
Win32 Start Address nt!KiIdleLoop (0xfffff80368e01f10)
Stack Init fffff8036c874c90 Current fffff8036c874c20
Base fffff8036c875000 Limit fffff8036c86f000 Call 0000000000000000
Priority 0 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 5
: Call Site
: nt!KeBugCheckEx
: nt!HalBugCheckSystem+0x7a
: PSHED!PshedBugCheckSystem+0x10
: nt!WheaReportHwError+0x46e
: nt!HalHandleNMI+0x142
: nt!KiProcessNMI+0x134
: nt!KxNmiInterrupt+0x82
: nt!KiNmiInterrupt+0x212 (TrapFrame @ fffff803`6c897e50)
: netkvm!virtqueue_get_buf_split+0x9d [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\VirtIO\VirtIORing.c @ 325]
: netkvm!virtqueue_get_buf+0x5f (Inline Function @ fffff803`6ea01c26) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\Virtio\virtio.h @ 87]
: netkvm!CVirtQueue::GetBuf+0x63 (Inline Function @ fffff803`6ea01c26) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-VirtQueue.h @ 208]
: netkvm!CTXVirtQueue::ReleaseTransmitBuffers+0xae [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_VirtQueue.cpp @ 294]
: netkvm!CTXVirtQueue::ProcessTXCompletions+0x38 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_VirtQueue.cpp @ 327]
: netkvm!CParaNdisTX::DoPendingTasks::__l2::<lambda_2bb8197fb2982050d71283ed06b8d199>::operator()+0x10 (Inline Function @ fffff803`6e9fee61) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 725]
: netkvm!CParaNdisTX::DoWithTXLock+0x1a (Inline Function @ fffff803`6e9fee61) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-TX.h @ 237]
: netkvm!CParaNdisTX::DoPendingTasks+0x75 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 723]
: netkvm!CParaNdisTX::NBLMappingDone+0x3f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 432]
: netkvm!CNBL::RegisterMappedNB+0x27 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 93]
: nt!HalpAllocateAdapterCallbackV3+0xa9
: nt!HalpBuildScatterGatherList+0x1af
: nt!HalBuildScatterGatherListV3+0x73
: ndis!NdisMAllocateNetBufferSGList+0x299
: netkvm!CNB::ScheduleBuildSGListForTx+0x2f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 784]
: netkvm!CNBL::StartMapping::__l2::<lambda_3029174f739c49f49e8fda0852d91fd3>::operator()+0x8 (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 285]
: netkvm!CNdisList<CNB,CRawAccess,CNonCountingObject>::ForEachPrepareIf+0x1e (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 465]
: netkvm!CNdisList<CNB,CRawAccess,CNonCountingObject>::ForEach+0x1e (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 446]
: netkvm!CNBL::StartMapping+0x48 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 283]
: netkvm!CParaNdisTX::Send+0x13c [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 411]
: netkvm!ParaNdis6_SendNetBufferLists+0x83 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_Driver.cpp @ 460]
: ndis!ndisMSendNBLToMiniportInternal+0xff
: ndis!ndisMSendNBLToMiniport+0xe
: ndis!ndisCallSendHandler+0x59
: ndis!ndisInvokeNextSendHandler+0x191
: ndis!NdisSendNetBufferLists+0x17d
: tcpip!FlpSendPacketsHelper+0x2bd
: tcpip!Fl48pSendArpPacket+0xcb
: tcpip!Fl48SendNeighborSolicitation+0x22
: tcpip!Ipv4pSendNeighborSolicitation+0x60
: tcpip!IppSendNeighborSolicitation+0xf3
: tcpip!IppNeighborSetTimeout+0x3c4
The other 3 call stacks all look similar:
1: kd> !thread
THREAD ffffc901089cb140 Cid 0000.0000 Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 1
Not impersonating
DeviceMap ffffb20b8ee36540
Owning Process fffff80369724a00 Image: Idle
Attached Process ffffde034267d040 Image: System
Wait Start TickCount 125774 Ticks: 336509 (0:01:27:37.953)
Context Switch Count 497827 IdealProcessor: 1
UserTime 00:00:00.000
KernelTime 00:29:53.109
Win32 Start Address nt!KiIdleLoop (0xfffff80368e01f10)
Stack Init ffffb00fd4e29c90 Current ffffb00fd4e29c20
Base ffffb00fd4e2a000 Limit ffffb00fd4e24000 Call 0000000000000000
Priority 0 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 0
Child-SP RetAddr : Args to Child : Call Site
: nt!KxWaitForSpinLockAndAcquire+0x2f
: nt!KeAcquireSpinLockAtDpcLevel+0x5c
: netkvm!CNdisSpinLock::LockDPR+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 137]
: netkvm!CDPCSpinLockedContext::{ctor}+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 226]
: netkvm!CParaNdisTX::DoWithTXLock+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-TX.h @ 236]
: netkvm!CParaNdisTX::DoPendingTasks+0x65 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 723]
: netkvm!ParaNdis_RXTXDPCWorkBody+0xd1 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_Common.cpp @ 1811]
: netkvm!MiniportInterruptDPC+0x5f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_Impl.cpp @ 380]
: ndis!ndisInterruptDpc+0x3c58a
: nt!KiExecuteAllDpcs+0x30e
: nt!KiRetireDpcList+0x1f4
: nt!KiIdleLoop+0x9e
I’m attaching the full kernel mode call stack, included the above as a teaser. I don’t have a complete memory dump of a hang as that necessitates increasing the pagefile size by a non-trivial amount and that could make the bug disappear, so I’m being as minimally invasive as possible right now.
This all happened with the Red Hat NDIS Miniport Driver version 100.92.104.22900 (as reported by the netkvm.sys file properties in Windows 10) which I believe corresponds to version 229 of the driver and associated source / executable files I found here:
https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/
I saw that 229 dates back to when I was looking to get better performance and see there is a v248 driver out now. I took both code bases and did some side by side comparisons and while I didn’t see any differences in VirtIORing.c I did see some in ParaNdis_TX.cpp nearby to where we hang but didn’t drill into the code in detail. There were enough differences I’m forking a thread attempting to reproduce the problem with the v248 driver from the same site and while it hasn’t hung up yet it hasn’t been cooking for very long and this is quite an evasive intermittent issue.
I thought another thread might be to detail what I’ve seen here on the off chance some hang condition has already been detected and fixed and I just couldn’t find such a scenario in the various bugs I’ve perused through.
I’ll post back if I get another hang.
Thank you! David Soussan
About this issue
- Original URL
- State: open
- Created 4 months ago
- Comments: 22
Switching the Win10 VM to use the ICH9 chipset instead of the PIIX3 and putting the VirtIO / netkvm NIC back in place, checking the properties of the NIC - I don’t even see a resources tab anymore.
I’m guessing that means it is using MSI interrupts?
In this configuration I started up a stress test to see what happens. Stay tuned, more to come…