freebsd-wifibox: wlan0 interface inside VM no longer comes up after a few times doing suspend/resume
Hello,
I started to experiment with wifibox yesterday and it works pretty well upon a fresh boot. Once I start suspend/resuming (around 2-3 times), the internal “wlan0” interface in the VM no longer comes back up. I’ve done a few experiments to see what’s going on. I don’t think this is a bug with wifibox, but rather either bhyve’s PCI passthrough code (and FLR). Restarting ‘vmm’ has no effect once my machine enters this state. Only a full reboot fixes it. I also did make sure that the devd suspend/resume workaround was being triggered by adding a “logger” statement and checking /var/log/messages afterwards. Reporting here for more exposure in this project. We could escalate it up to FreeBSD as needed.
Running On (but probably will also happen on 14-CURRENT):
jon@leslie:~ $ uname -a
FreeBSD leslie 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n251926-488f9d85278: Mon Jul 25 21:01:52 EDT 2022 root@leslie:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
root@leslie:~ # pkg info | grep wifibox
wifibox-1.1.1 Wireless card driver via virtualized Linux
wifibox-alpine-20220712 Wifibox guest based on Alpine Linux
wifibox-core-0.10.0 Wifibox core functionality
This is the card information if I were to use the ‘if_iwlwifi’ module:
Jul 26 10:28:16 leslie kernel: Intel(R) Wireless WiFi based driver for FreeBSD
Jul 26 10:28:16 leslie kernel: iwlwifi0: <iwlwifi> mem 0xea238000-0xea23bfff at device 20.3 on pci0
Jul 26 10:28:16 leslie kernel: iwlwifi0: could not load firmware image 'iwlwifi-QuZ-a0-jf-b0-72.ucode'
Jul 26 10:28:16 leslie kernel: iwlwifi0: File size way too small!
Jul 26 10:28:16 leslie kernel: iwlwifi0: successfully loaded firmware image 'iwlwifi-QuZ-a0-jf-b0-71.ucode'
Jul 26 10:28:16 leslie kernel: iwlwifi0: api flags index 2 larger than supported by driver
Jul 26 10:28:16 leslie kernel: iwlwifi0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
Jul 26 10:28:16 leslie kernel: iwlwifi0: loaded firmware version 71.058653f6.0 QuZ-a0-jf-b0-71.ucode op_mode iwlmvm
Jul 26 10:28:16 leslie kernel: iwlwifi0: Detected Intel(R) Wireless-AC 9560 160MHz, REV=0x351
Jul 26 10:28:42 leslie kernel: iwlwifi0: Detected RF JF, rfid=0x105110
Jul 26 10:28:42 leslie kernel: iwlwifi0: base HW address: f8:e4:e3:eb:35:02
I’ve noticed that there is a higher chance for this situation to happen once I start getting FLR transaction messages:
Jul 26 11:16:08 leslie kernel: bridge0: Ethernet address: 58:9c:fc:10:ff:99
Jul 26 11:16:08 leslie kernel: bridge0: changing name to 'wifibox0'
Jul 26 11:16:08 leslie kernel: tap0: Ethernet address: 58:9c:fc:10:c0:3f
Jul 26 11:16:08 leslie kernel: tap0: promiscuous mode enabled
Jul 26 11:16:08 leslie kernel: wifibox0: link state changed to DOWN
Jul 26 11:16:08 leslie kernel: ppt0 mem 0xea238000-0xea23bfff at device 20.3 on pci0
Jul 26 11:16:08 leslie kernel: tap0: link state changed to UP
Jul 26 11:16:08 leslie kernel: wifibox0: link state changed to UP
Jul 26 11:16:08 leslie kernel: pci0:0:20:3: Resetting with transactions pending after 50 ms
Jul 26 11:16:08 leslie kernel: pci0:0:20:3: Transactions pending after FLR!
Host results (At this point this layer is correct but doesn’t matter because the wlan0 inside the VM is gone.)
root@leslie:~ # ifconfig
em0: flags=8862<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=481049b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,VLAN_HWFILTER,NOMAP>
ether f8:75:a4:ef:9d:81
media: Ethernet autoselect
status: no carrier
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
inet 127.0.0.1 netmask 0xff000000
groups: lo
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wifibox0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 58:9c:fc:10:ff:99
inet 10.1.0.2 netmask 0xffffff00 broadcast 10.1.0.255
id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 4 priority 128 path cost 2000000
groups: bridge
nd6 options=9<PERFORMNUD,IFDISABLED>
tap0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000<LINKSTATE>
ether 58:9c:fc:10:c0:3f
groups: tap
media: Ethernet autoselect
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
Opened by PID 14824
VM networking results upon a bad resume
wifibox:~# ifconfig
eth0 Link encap:Ethernet HWaddr 00:A0:98:8A:05:71
inet addr:10.1.0.1 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::2a0:98ff:fe8a:571/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:30 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:2428 (2.3 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Attempting to bring up the wlan0 interface in the VM.
wifibox:~# service networking restart
* WARNING: you are stopping a boot service
* WARNING: you are stopping a boot service
* Stopping Unix Domain Socket pass-through ...
[ ok ]
* Stopping busybox udhcpd ...
[ ok ]
* Stopping networking ...
* lo ...
[ ok ]
* eth0 ...
[ ok ]
* Starting networking ...
* lo ...
[ ok ]
* eth0 ...
[ ok ]
* wlan0 ...
ip: SIOCGIFFLAGS: No such device
udhcpc: SIOCGIFINDEX: No such device
ifup: failed to change interface wlan0 state to 'up'
[ !! ]
wifibox:~# * Starting Unix Domain Socket pass-through ...
* Starting busybox udhcpd ...
[ ok ]
[ ok ]
Check if ppt0 is attached to the pci device after we started wifibox (As we can see, it is and that’s correct).
root@leslie:~ # pciconf -lv
ppt0@pci0:0:20:3: class=0x028000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x02f0 subvendor=0x8086 subdevice=0x0030
vendor = 'Intel Corporation'
device = 'Comet Lake PCH-LP CNVi WiFi'
class = network
Shutting down the VM should clear the ppt driver from the wifi card. “The PPT device could not be destroyed” message always pops up even if the resume didn’t break the connection. It usually takes 2-3 times before it breaks.
root@leslie:~ # service wifibox stop
Stopping wifibox....WARNING: /var/log/wifibox.log is not writeable, messages could not be logged.
WARNING: The PPT device could not be destroyed.
...OK
root@leslie:~ # pciconf -lv
none3@pci0:0:20:3: class=0x028000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x02f0 subvendor=0x8086 subdevice=0x0030
vendor = 'Intel Corporation'
device = 'Comet Lake PCH-LP CNVi WiFi'
class = network
There was also an instance where I was able to load the ‘if_iwlwifi’ module after the wlan0 broke in the VM, and was able to use wireless on the host directly. However, this isn’t guaranteed and eventually I’ll start getting “ifconfig: SIOCIFCREATE2: Device not configured” messages, preventing the wlan0 interface from even starting up.
Attached is part of my /var/log/messages file that may help:
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 50 (18 by maintainers)
Thanks to your devd edits, my ghostbsd thinkpad is reconnecting at waking from sleep.
Hi, I just bought an X1 Carbon Gen 7 and managed to configure wifibox. Only thing left was the suspend / resume issue. The only setting that works for me is the following one, based on cthe preceding omment h(ttps://github.com/pgj/freebsd-wifibox/issues/31#issuecomment-1386305573):
(/usr/local/etc/devd/wifibox.conf)
I ran a few more experiments making just the
wifibox suspendgo before the suspend notify. If I do this, the system is able to survive a few suspend/resumes but it is non deterministic. Usually within 2-5 resumes it will not be able to bring up the interface anymore. Same applies if I do the full stop and start of wifibox. I started looking a little bit at the latest code to see if I can narrow it down. At the moment I’m looking at the differences between theGandGNflags and also what was happening in the 0.10.0 release.@pgj That’s correct, HEAD here refers to
main. Primarily since there isn’t an official released port containing the new changes, I’ll want to clone it and then do my testing from that side.I’ve switched from my Thinkpad X1C7 back to my Framework Laptop (Batch 6). I was able to reproduce the same issue on this machine, and my workaround also worked as well.
Small update on the wireguard routing situation. My initial configuration was correct (regarding slightly shifting the wifibox addresses from their default of
10.0.0.0/24->10.1.0.0/24), however, I had forgotten that on my server, when I re-architected it to use jails, I locked down the hostListenAddressto192.168.1.100, so that meant the previous10.0.0.1address that is the IP for that machine on my wireguard network, was no longer listened to. I addedListenAddress 10.0.0.1in addition to the192.168.1.100and it’s all working. Wifibox has no issues communicating with any of my192.168.1.Xservices and doesn’t have issues communicating with10.0.0.Xwireguard VPN services as well 😄.I’ve opened up this ticket and emailed the freebsd-virtualization mailing list.
I’ll be experimenting with your suggestions regarding the restarting only the guest and not needing to run dhclient.
I’ve automated the suspend/resume for stopping/starting/assigning ip of wifibox and it seems to be working (I’ll need more time to test of course). I added the suspend section (not sure if the notify number is correct but it seems to be functioning correctly when doing multiple suspend/resumes):
and this is how the
/var/log/messages | grep wifiboxlooks like after a suspend/resume cycle:Notice that the
/etc/rc.suspend acpi $notifycall happens at the end of our action event. I originally tried it at the beginning but when I resumed the PCI passthrough issue immediately popped up. I’m guessing is because since I called suspend before, the system didn’t have enough time to finish stopping the wifibox VM, so we got the call, but not a clean shutdown of the VM, thus triggering my theory of the PCI devices not being successfully cleaned up when the VM is running. When I flipped it and put the/etc/rc.suspendcall at the end, I noticed that my sleep took a little bit longer but that would mean that it was just cleanly shutting down the VM before sleeping. It also didn’t matter if I ranzzzas root or as the user.Lastly, there was one time I resumed where one of the machine lights lit up but it actually never resumed (The power button still was flashing as if it was sleeping, but the F1 (Mute Key Light) was on - which only happens when the machine is powered back on). I don’t necessarily think this is related to the above changes since I’ve experienced these sort of things on different machines before. It may just be because I’m sleeping and resuming multiple times in relatively quick succession (10 seconds ~ maybe more maybe less)… but I wanted to mention this for the record just in case.
Let me know what you think 😃.
@pgj I’ve noticed (still need more testing though) a possible workaround to this issue. It seems that if I stop wifibox before I put the computer to sleep, it will allow the machine to resume and still keep the PCI interface in the VM. This makes me believe that there is some resources not cleanly being handled when a
zzzoperation is triggered with an actively running bhyve VM using a PCI passthrough device.I’ve also updated to
FreeBSD 13.1-STABLE #0 stable/13-n251972-fb8e858c69f, no major changes but just wanted to keep you up to date.This is the latest
dmesgoutput (with me suspend/resume but manually stopping the wifibox before I run zzz):messages-2.txt
These may be of interest (in a correct flow):
I’ll be testing this for a few days to see what happens and report back. I’ll probably add a little bit of automation (example, when I suspend my screen via shortcuts on i3, I’ll do a
wifibox stop && zzzcommand, and then on resume I’ll have the devfs resume event do awifibox start && dhclient wifibox0). I’ll let you know what happens.@pgj I compiled 13.1-RELEASE from source (
releng/13.1) and re-tested my steps. I was still able to reproduce it on:For a second I thought that there may have been some weird behavior happening between my Anker Dongle (HDMI/Ethernet/USB Ports/SD/Micro SD/Power all over Type C) and my Thinkpad (maybe something about going to sleep and resuming with all of that attached). However, after about 5-6 consecutive working sleep/resume, it finally dropped the
00:06.0 Class 0280: 8086:02f0slot from the VM. So it seems the same issue has been there and I don’t think there is a regression, just a bug somewhere.I’ve attached a longer
/var/log/messagesbelow (contains -STABLE and then -RELEASE logs). Things related toppt0 mem 0xea238000-0xea23bfff at device 20.3 on pci0seem interesting.messages-1.txt
@pgj Sounds good. I guess I should hope it fails for you too on STABLE and it isn’t just me 😂.
And yea, you could checkout
488f9d85278 (n251926)on thestable/13branch.