freebsd-wifibox: wlan0 interface inside VM no longer comes up after a few times doing suspend/resume

Hello,

I started to experiment with wifibox yesterday and it works pretty well upon a fresh boot. Once I start suspend/resuming (around 2-3 times), the internal “wlan0” interface in the VM no longer comes back up. I’ve done a few experiments to see what’s going on. I don’t think this is a bug with wifibox, but rather either bhyve’s PCI passthrough code (and FLR). Restarting ‘vmm’ has no effect once my machine enters this state. Only a full reboot fixes it. I also did make sure that the devd suspend/resume workaround was being triggered by adding a “logger” statement and checking /var/log/messages afterwards. Reporting here for more exposure in this project. We could escalate it up to FreeBSD as needed.

Running On (but probably will also happen on 14-CURRENT):

jon@leslie:~ $ uname -a
FreeBSD leslie 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n251926-488f9d85278: Mon Jul 25 21:01:52 EDT 2022     root@leslie:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

root@leslie:~ # pkg info | grep wifibox
wifibox-1.1.1                  Wireless card driver via virtualized Linux
wifibox-alpine-20220712        Wifibox guest based on Alpine Linux
wifibox-core-0.10.0            Wifibox core functionality

This is the card information if I were to use the ‘if_iwlwifi’ module:

Jul 26 10:28:16 leslie kernel: Intel(R) Wireless WiFi based driver for FreeBSD
Jul 26 10:28:16 leslie kernel: iwlwifi0: <iwlwifi> mem 0xea238000-0xea23bfff at device 20.3 on pci0
Jul 26 10:28:16 leslie kernel: iwlwifi0: could not load firmware image 'iwlwifi-QuZ-a0-jf-b0-72.ucode'
Jul 26 10:28:16 leslie kernel: iwlwifi0: File size way too small!
Jul 26 10:28:16 leslie kernel: iwlwifi0: successfully loaded firmware image 'iwlwifi-QuZ-a0-jf-b0-71.ucode'
Jul 26 10:28:16 leslie kernel: iwlwifi0: api flags index 2 larger than supported by driver
Jul 26 10:28:16 leslie kernel: iwlwifi0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
Jul 26 10:28:16 leslie kernel: iwlwifi0: loaded firmware version 71.058653f6.0 QuZ-a0-jf-b0-71.ucode op_mode iwlmvm
Jul 26 10:28:16 leslie kernel: iwlwifi0: Detected Intel(R) Wireless-AC 9560 160MHz, REV=0x351
Jul 26 10:28:42 leslie kernel: iwlwifi0: Detected RF JF, rfid=0x105110
Jul 26 10:28:42 leslie kernel: iwlwifi0: base HW address: f8:e4:e3:eb:35:02

I’ve noticed that there is a higher chance for this situation to happen once I start getting FLR transaction messages:

Jul 26 11:16:08 leslie kernel: bridge0: Ethernet address: 58:9c:fc:10:ff:99
Jul 26 11:16:08 leslie kernel: bridge0: changing name to 'wifibox0'
Jul 26 11:16:08 leslie kernel: tap0: Ethernet address: 58:9c:fc:10:c0:3f
Jul 26 11:16:08 leslie kernel: tap0: promiscuous mode enabled
Jul 26 11:16:08 leslie kernel: wifibox0: link state changed to DOWN
Jul 26 11:16:08 leslie kernel: ppt0 mem 0xea238000-0xea23bfff at device 20.3 on pci0
Jul 26 11:16:08 leslie kernel: tap0: link state changed to UP
Jul 26 11:16:08 leslie kernel: wifibox0: link state changed to UP
Jul 26 11:16:08 leslie kernel: pci0:0:20:3: Resetting with transactions pending after 50 ms
Jul 26 11:16:08 leslie kernel: pci0:0:20:3: Transactions pending after FLR!

Host results (At this point this layer is correct but doesn’t matter because the wlan0 inside the VM is gone.)

root@leslie:~ # ifconfig
em0: flags=8862<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=481049b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,VLAN_HWFILTER,NOMAP>
	ether f8:75:a4:ef:9d:81
	media: Ethernet autoselect
	status: no carrier
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
	inet6 ::1 prefixlen 128
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
	inet 127.0.0.1 netmask 0xff000000
	groups: lo
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wifibox0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	ether 58:9c:fc:10:ff:99
	inet 10.1.0.2 netmask 0xffffff00 broadcast 10.1.0.255
	id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
	maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
	root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
	member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 4 priority 128 path cost 2000000
	groups: bridge
	nd6 options=9<PERFORMNUD,IFDISABLED>
tap0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=80000<LINKSTATE>
	ether 58:9c:fc:10:c0:3f
	groups: tap
	media: Ethernet autoselect
	status: active
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	Opened by PID 14824

VM networking results upon a bad resume

wifibox:~# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:A0:98:8A:05:71  
          inet addr:10.1.0.1  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::2a0:98ff:fe8a:571/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:2428 (2.3 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Attempting to bring up the wlan0 interface in the VM.

wifibox:~# service networking restart
 * WARNING: you are stopping a boot service
 * WARNING: you are stopping a boot service
 * Stopping Unix Domain Socket pass-through ...
 [ ok ]
 * Stopping busybox udhcpd ...
 [ ok ]
 * Stopping networking ...
 *   lo ...
 [ ok ]
 *   eth0 ...
 [ ok ]
 * Starting networking ...
 *   lo ...
 [ ok ]
 *   eth0 ...
 [ ok ]
 *   wlan0 ...
ip: SIOCGIFFLAGS: No such device
udhcpc: SIOCGIFINDEX: No such device
ifup: failed to change interface wlan0 state to 'up'
 [ !! ]
wifibox:~#  * Starting Unix Domain Socket pass-through ...
 * Starting busybox udhcpd ...
 [ ok ]
 [ ok ]

Check if ppt0 is attached to the pci device after we started wifibox (As we can see, it is and that’s correct).

root@leslie:~ # pciconf -lv
ppt0@pci0:0:20:3:       class=0x028000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x02f0 subvendor=0x8086 subdevice=0x0030
    vendor     = 'Intel Corporation'
    device     = 'Comet Lake PCH-LP CNVi WiFi'
    class      = network

Shutting down the VM should clear the ppt driver from the wifi card. “The PPT device could not be destroyed” message always pops up even if the resume didn’t break the connection. It usually takes 2-3 times before it breaks.

root@leslie:~ # service wifibox stop
Stopping wifibox....WARNING: /var/log/wifibox.log is not writeable, messages could not be logged.
WARNING: The PPT device could not be destroyed.
...OK

root@leslie:~ # pciconf -lv
none3@pci0:0:20:3:      class=0x028000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x02f0 subvendor=0x8086 subdevice=0x0030
    vendor     = 'Intel Corporation'
    device     = 'Comet Lake PCH-LP CNVi WiFi'
    class      = network

There was also an instance where I was able to load the ‘if_iwlwifi’ module after the wlan0 broke in the VM, and was able to use wireless on the host directly. However, this isn’t guaranteed and eventually I’ll start getting “ifconfig: SIOCIFCREATE2: Device not configured” messages, preventing the wlan0 interface from even starting up.

Attached is part of my /var/log/messages file that may help:

report.log

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 50 (18 by maintainers)

Most upvoted comments

Hi, I just bought an X1 Carbon Gen 7 and managed to configure wifibox. Only thing left was the suspend / resume issue. The only setting that works for me is the following one, based on cthe preceding omment h(ttps://github.com//issues/31#issuecomment-1386305573):

(/usr/local/etc/devd/wifibox.conf)

notify 11 {
		match "system" "ACPI";
		match "subsystem" "Suspend";
		action "logger 'Stopping wifibox before suspend' && /usr/local/sbin/wifibox stop guest && /etc/rc.suspend acpi $notify";
};

notify 11 {
		match "system" "ACPI";
		match "subsystem" "Resume";
		action "/etc/rc.resume acpi $notify && logger 'Starting wifibox after resume and getting IP via DHCP' && /sbin/kldunload vmm && /sbin/kldload vmm && /usr/local/sbin/wifibox start guest && /sbin/dhclient wifibox0";
};

Thanks to your devd edits, my ghostbsd thinkpad is reconnecting at waking from sleep.

Hi, I just bought an X1 Carbon Gen 7 and managed to configure wifibox. Only thing left was the suspend / resume issue. The only setting that works for me is the following one, based on cthe preceding omment h(ttps://github.com/pgj/freebsd-wifibox/issues/31#issuecomment-1386305573):

(/usr/local/etc/devd/wifibox.conf)

notify 11 {
		match "system" "ACPI";
		match "subsystem" "Suspend";
		action "logger 'Stopping wifibox before suspend' && /usr/local/sbin/wifibox stop guest && /etc/rc.suspend acpi $notify";
};

notify 11 {
		match "system" "ACPI";
		match "subsystem" "Resume";
		action "/etc/rc.resume acpi $notify && logger 'Starting wifibox after resume and getting IP via DHCP' && /sbin/kldunload vmm && /sbin/kldload vmm && /usr/local/sbin/wifibox start guest && /sbin/dhclient wifibox0";
};

I ran a few more experiments making just the wifibox suspend go before the suspend notify. If I do this, the system is able to survive a few suspend/resumes but it is non deterministic. Usually within 2-5 resumes it will not be able to bring up the interface anymore. Same applies if I do the full stop and start of wifibox. I started looking a little bit at the latest code to see if I can narrow it down. At the moment I’m looking at the differences between the G and GN flags and also what was happening in the 0.10.0 release.

@pgj That’s correct, HEAD here refers to main. Primarily since there isn’t an official released port containing the new changes, I’ll want to clone it and then do my testing from that side.

I’ve switched from my Thinkpad X1C7 back to my Framework Laptop (Batch 6). I was able to reproduce the same issue on this machine, and my workaround also worked as well.

Small update on the wireguard routing situation. My initial configuration was correct (regarding slightly shifting the wifibox addresses from their default of 10.0.0.0/24 -> 10.1.0.0/24), however, I had forgotten that on my server, when I re-architected it to use jails, I locked down the host ListenAddress to 192.168.1.100, so that meant the previous 10.0.0.1 address that is the IP for that machine on my wireguard network, was no longer listened to. I added ListenAddress 10.0.0.1 in addition to the 192.168.1.100 and it’s all working. Wifibox has no issues communicating with any of my 192.168.1.X services and doesn’t have issues communicating with 10.0.0.X wireguard VPN services as well 😄.

I’ve opened up this ticket and emailed the freebsd-virtualization mailing list.

I’ll be experimenting with your suggestions regarding the restarting only the guest and not needing to run dhclient.

I’ve automated the suspend/resume for stopping/starting/assigning ip of wifibox and it seems to be working (I’ll need more time to test of course). I added the suspend section (not sure if the notify number is correct but it seems to be functioning correctly when doing multiple suspend/resumes):

root@leslie:~ # cat /usr/local/etc/devd/wifibox.conf

# This is a `devd(8)` configuration file to run the resume action of
# wifibox on the ACPI resume event.  Review the contents and create a
# copy of it without the `.sample` extension to use it.  Restart the
# `devd` service once the file has been created.

notify 11 {
        match "system"          "ACPI";
        match "subsystem"       "Suspend";
        action "logger 'Stopping wifibox before suspend' && /usr/local/sbin/wifibox stop && /etc/rc.suspend acpi $notify";
};

notify 11 {
        match "system"          "ACPI";
        match "subsystem"       "Resume";
        action "/etc/rc.resume acpi $notify && logger 'Starting wifibox after resume and getting IP via DHCP' && /usr/local/sbin/wifibox start && /sbin/dhclient wifibox0";
};

and this is how the /var/log/messages | grep wifibox looks like after a suspend/resume cycle:

Jul 28 14:14:08 leslie root[3910]: Stopping wifibox before suspend
Jul 28 14:14:11 leslie kernel: wifibox0: link state changed to DOWN
Jul 28 14:14:12 leslie dhclient[3177]: receive_packet failed on wifibox0: Device not configured
Jul 28 14:14:12 leslie dhclient[3177]: ioctl(SIOCGIFFLAGS) on wifibox0: Operation not permitted
Jul 28 14:14:12 leslie dhclient[3177]: Interface wifibox0 no longer appears valid.
Jul 28 14:14:38 leslie root[4098]: Starting wifibox after resume and getting IP via DHCP
Jul 28 14:14:38 leslie kernel: bridge0: changing name to 'wifibox0'
Jul 28 14:14:38 leslie kernel: wifibox0: link state changed to DOWN
Jul 28 14:14:38 leslie kernel: wifibox0: link state changed to UP
Jul 28 14:14:55 leslie dhclient[4412]: New IP Address (wifibox0): 10.1.0.2
Jul 28 14:14:55 leslie dhclient[4416]: New Subnet Mask (wifibox0): 255.255.255.0
Jul 28 14:14:55 leslie dhclient[4420]: New Broadcast Address (wifibox0): 10.1.0.255
Jul 28 14:14:55 leslie dhclient[4424]: New Routers (wifibox0): 10.1.0.1

Notice that the /etc/rc.suspend acpi $notify call happens at the end of our action event. I originally tried it at the beginning but when I resumed the PCI passthrough issue immediately popped up. I’m guessing is because since I called suspend before, the system didn’t have enough time to finish stopping the wifibox VM, so we got the call, but not a clean shutdown of the VM, thus triggering my theory of the PCI devices not being successfully cleaned up when the VM is running. When I flipped it and put the /etc/rc.suspend call at the end, I noticed that my sleep took a little bit longer but that would mean that it was just cleanly shutting down the VM before sleeping. It also didn’t matter if I ran zzz as root or as the user.

Lastly, there was one time I resumed where one of the machine lights lit up but it actually never resumed (The power button still was flashing as if it was sleeping, but the F1 (Mute Key Light) was on - which only happens when the machine is powered back on). I don’t necessarily think this is related to the above changes since I’ve experienced these sort of things on different machines before. It may just be because I’m sleeping and resuming multiple times in relatively quick succession (10 seconds ~ maybe more maybe less)… but I wanted to mention this for the record just in case.

Let me know what you think 😃.

@pgj I’ve noticed (still need more testing though) a possible workaround to this issue. It seems that if I stop wifibox before I put the computer to sleep, it will allow the machine to resume and still keep the PCI interface in the VM. This makes me believe that there is some resources not cleanly being handled when a zzz operation is triggered with an actively running bhyve VM using a PCI passthrough device.

I’ve also updated to FreeBSD 13.1-STABLE #0 stable/13-n251972-fb8e858c69f, no major changes but just wanted to keep you up to date.

This is the latest dmesg output (with me suspend/resume but manually stopping the wifibox before I run zzz):

messages-2.txt

These may be of interest (in a correct flow):

bridge0: Ethernet address: 58:9c:fc:10:ff:99
bridge0: changing name to 'wifibox0'
tap0: Ethernet address: 58:9c:fc:10:c0:3f
tap0: promiscuous mode enabled
wifibox0: link state changed to DOWN
ppt0 mem 0xea238000-0xea23bfff at device 20.3 on pci0
tap0: link state changed to UP
wifibox0: link state changed to UP
lo0: link state changed to UP


...

tap0: link state changed to DOWN
wifibox0: link state changed to DOWN
ppt0: detached
pci0: <network> at device 20.3 (no driver attached)
ue0: link state changed to UP
tap0: promiscuous mode disabled

...

ppt0: detached
pci0: <network> at device 20.3 (no driver attached)
ppt0 mem 0xea238000-0xea23bfff at device 20.3 on pci0
bridge0: Ethernet address: 58:9c:fc:10:ff:99
bridge0: changing name to 'wifibox0'
tap0: Ethernet address: 58:9c:fc:10:c0:3f
tap0: promiscuous mode enabled
wifibox0: link state changed to DOWN
tap0: link state changed to UP
wifibox0: link state changed to UP
tap0: link state changed to DOWN
wifibox0: link state changed to DOWN
ppt0: detached
pci0: <network> at device 20.3 (no driver attached)
ppt0 mem 0xea238000-0xea23bfff at device 20.3 on pci0
tap0: link state changed to UP
wifibox0: link state changed to UP
tap0: link state changed to DOWN
wifibox0: link state changed to DOWN
ppt0: detached
pci0: <network> at device 20.3 (no driver attached)
tap0: promiscuous mode disabled

I’ll be testing this for a few days to see what happens and report back. I’ll probably add a little bit of automation (example, when I suspend my screen via shortcuts on i3, I’ll do a wifibox stop && zzz command, and then on resume I’ll have the devfs resume event do a wifibox start && dhclient wifibox0). I’ll let you know what happens.

@pgj I compiled 13.1-RELEASE from source (releng/13.1) and re-tested my steps. I was still able to reproduce it on:

FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC amd64

For a second I thought that there may have been some weird behavior happening between my Anker Dongle (HDMI/Ethernet/USB Ports/SD/Micro SD/Power all over Type C) and my Thinkpad (maybe something about going to sleep and resuming with all of that attached). However, after about 5-6 consecutive working sleep/resume, it finally dropped the 00:06.0 Class 0280: 8086:02f0 slot from the VM. So it seems the same issue has been there and I don’t think there is a regression, just a bug somewhere.

I’ve attached a longer /var/log/messages below (contains -STABLE and then -RELEASE logs). Things related to ppt0 mem 0xea238000-0xea23bfff at device 20.3 on pci0 seem interesting.

messages-1.txt

@pgj Sounds good. I guess I should hope it fails for you too on STABLE and it isn’t just me 😂.

And yea, you could checkout 488f9d85278 (n251926) on the stable/13 branch.