zephyr: Crash when using SPI bluetooth controller with IRQ pin attached to port expander
Describe the bug When using an SPI-attached bluetooth controller (e.g. BlueNRG-MS) whose IRQ pin is attached to a port expander, deadlocks and subsequent crashes occur. This is because both the blutetooth stack and the port expander driver rely on the system work queue.
The sequence of events is as follows:
- Bluetooth stack submits a work item to the system work queue (
grep -Rw k_work_submit subsys/bluetooth/) - This work item calls
bt_hci_cmd_send_sync, which blocks onsync_sem: https://github.com/zephyrproject-rtos/zephyr/blob/9f46db90c832e56363c5d7bb42892651b39b271d/subsys/bluetooth/host/hci_core.c#L329-L330 sync_semis given throughbt_spi_rx_thread->bt_recv-> … ->hci_cmd_complete->hci_cmd_donewhen an interrupt is received on the SPI interface- In order to dispatch pin interrupt callbacks, the port expander driver schedules a work item in the system work queue (
grep k_work_submit drivers/gpio/*.c). - Now, the system work queue is blocked (in
bt_hci_cmd_send_sync), waiting for the next item in the system work queue (the interrupt callback) to be run.
Example backtrace of the blocking work item (which eventually errors):
...
#5 0x0803031e in _oops () at $ZEPHYR/arch/arm/core/aarch32/swap_helper.S:570
#6 <signal handler called>
#7 0x08034c60 in bt_hci_cmd_send_sync (opcode=opcode@entry=8202, buf=buf@entry=0x2004ec08 <_net_buf_hci_cmd_pool+32>, rsp=rsp@entry=0x0) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:330
#8 0x08037fd2 in bt_le_adv_set_enable_legacy (adv=adv@entry=0x20020a48 <bt_dev+24>, enable=<optimized out>) at $ZEPHYR/subsys/bluetooth/host/adv.c:336
#9 0x0803800e in bt_le_adv_set_enable (adv=adv@entry=0x20020a48 <bt_dev+24>, enable=enable@entry=true) at $ZEPHYR/subsys/bluetooth/host/adv.c:386
#10 0x080386ec in bt_le_adv_resume () at $ZEPHYR/subsys/bluetooth/host/adv.c:1513
#11 0x08039b6a in bt_conn_unref (conn=conn@entry=0x20027b28 <acl_conns>) at $ZEPHYR/subsys/bluetooth/host/conn.c:1267
#12 0x0803aaa2 in deferred_work (work=0x20027b88 <acl_conns+96>) at $ZEPHYR/subsys/bluetooth/host/conn.c:1734
#13 0x080aaec4 in work_queue_main (workq_ptr=0x2002c6e0 <k_sys_work_q>, p2=<optimized out>, p3=<optimized out>) at $ZEPHYR/kernel/work.c:668
#14 0x0801c38a in z_thread_entry (entry=0x80aadf5 <work_queue_main>, p1=0x2002c6e0 <k_sys_work_q>, p2=0x0, p3=0x0) at $ZEPHYR/lib/os/thread_entry.c:36
#15 0xaaaaaaaa in ?? ()
Example backtrace of a successful run of the interrupt callback, triggering bt_spi_rx_thread (in the deadlock case, this work item is stuck in the work queue):
#0 bt_spi_isr (unused1=0x80b5f44 <__device_dts_ord_35>, unused2=0x20038970 <gpio_cb>, unused3=1) at $ZEPHYR/drivers/bluetooth/hci/spi.c:165
#1 0x0806cff4 in gpio_fire_callbacks (list=list@entry=0x2002204c <mcp230xx_0_drvdata+28>, port=port@entry=0x80b5f44 <__device_dts_ord_35>, pins=1) at $ZEPHYR/include/zephyr/drivers/gpio/gpio_utils.h:78
#2 0x0806db6a in mcp23xxx_work_handler (work=0x20022064 <mcp230xx_0_drvdata+52>) at $ZEPHYR/drivers/gpio/gpio_mcp23xxx.c:446
#3 0x080aaec4 in work_queue_main (workq_ptr=0x2002c6e0 <k_sys_work_q>, p2=<optimized out>, p3=<optimized out>) at $ZEPHYR/kernel/work.c:668
#4 0x0801c38a in z_thread_entry (entry=0x80aadf5 <work_queue_main>, p1=0x2002c6e0 <k_sys_work_q>, p2=0x0, p3=0x0) at $ZEPHYR/lib/os/thread_entry.c:36
#5 0xaaaaaaaa in ?? ()
Example backtrace of the bt_spi_rx_thread unblocking the first work item:
#0 hci_cmd_complete (buf=0x2004ebe8 <_net_buf_hci_cmd_pool>) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:2101
#1 0x0803335c in handle_event_common (event=event@entry=14 '\016', buf=buf@entry=0x2004ebe8 <_net_buf_hci_cmd_pool>, handlers=handlers@entry=0x80c42b8 <prio_events>, num_handlers=num_handlers@entry=5) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:164
#2 0x08034148 in handle_event (event=<optimized out>, buf=buf@entry=0x2004ebe8 <_net_buf_hci_cmd_pool>, handlers=handlers@entry=0x80c42b8 <prio_events>, num_handlers=num_handlers@entry=5) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:176
#3 0x08036284 in hci_event_prio (buf=buf@entry=0x2004ebe8 <_net_buf_hci_cmd_pool>) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:3475
#4 0x0803635e in bt_recv (buf=buf@entry=0x2004ebe8 <_net_buf_hci_cmd_pool>) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:3525
#5 0x08076760 in bt_spi_rx_thread () at $ZEPHYR/drivers/bluetooth/hci/spi.c:381
#6 0x0801c38a in z_thread_entry (entry=0x80764fd <bt_spi_rx_thread>, p1=0x0, p2=0x0, p3=0x0) at $ZEPHYR/lib/os/thread_entry.c:36
#7 0xaaaaaaaa in ?? ()
To Reproduce Steps to reproduce the behavior: None so far, but I can try to create a reproducer if necessary (though it would still require specific hardware)
Expected behavior No deadlocks occur, bluetooth works as expected.
Impact Bluetooth is unusable on boards where a BlueNRG-MS’s IRQ pin is connected through a port expander.
Logs and console output Eventually crashes with an oops:
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:330
k_sem_take failed with err -11
[00:01:10.319,000] <err> os: r0/a1: 0x00000003 r1/a2: 0x00000002 r2/a3: 0x00000000
[00:01:10.319,000] <err> os: r3/a4: 0x00000000 r12/ip: 0x000abadb r14/lr: 0x08034c55
[00:01:10.319,000] <err> os: xpsr: 0x41000000
[00:01:10.319,000] <err> os: s[ 0]: 0x00000000 s[ 1]: 0x00000000 s[ 2]: 0x00000000 s[ 3]: 0x00000000
[00:01:10.319,000] <err> os: s[ 4]: 0x00000000 s[ 5]: 0x00000000 s[ 6]: 0x00000000 s[ 7]: 0x00000000
[00:01:10.320,000] <err> os: s[ 8]: 0x00000000 s[ 9]: 0x00000000 s[10]: 0x00000000 s[11]: 0x00000000
[00:01:10.320,000] <err> os: s[12]: 0x00000000 s[13]: 0x00000000 s[14]: 0x00000000 s[15]: 0x00000000
[00:01:10.320,000] <err> os: fpscr: 0x00000000
[00:01:10.320,000] <err> os: Faulting instruction address (r15/pc): 0x08034c60
[00:01:10.320,000] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:01:10.320,000] <err> os: Current thread: 0x2002c6e0 (sysworkq)
[00:01:10.320,000] <err> os: Halting system
Environment (please complete the following information):
- OS: Linux
- Toolchain: Zephyr SDK 0.15.1
- Commit SHA or Version used: v3.3.0
Additional context
I have found some past issues related to the use of the system work queue by the bluetooth stack:
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 19 (15 by maintainers)
It does mean that the issue will effectively be closed as wontfix.
Can’t really do that if the hardware already exists.