zephyr: Bluetooth: Deadlock with TX of ACL data and HCI commands (command blocked by data)
Describe the bug Performing GATT write commands from a peripheral and a central as fast as possible from the main loop while the peer central is performing security procedure to bond, causes the following assertion fail on both the peripheral and central device
In peripheral:
*** Booting Zephyr OS build v2.3.0-rc1-368-g8fecd4b32d9d ***
Bluetooth initialized
[00:00:00.007,507] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00:00.007,507] <inf> fs_nvs: alloc wra: 0, ff0
[00:00:00.007,507] <inf> fs_nvs: data wra: 0, 0
[00:00:00.008,941] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,941] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,941] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.3 Build 0
[00:00:00.009,216] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:00.031,829] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.031,860] <inf> bt_hci_core: HCI: version 5.2 (0x0b) revision 0x0000, manufacturer 0x05f1
[00:00:00.031,860] <inf> bt_hci_core: LMP: version 5.2 (0x0b) subver 0xffff
Advertising successfully started
Connected: ea:15:23:d0:ac:03 (random)
MTU: 23
Exchange pending...Write: count= 0, len= 0, rate= 0 bps.
MTU exchange successful (247)
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:370
k_sem_take failed with err -11
[00:00:27.460,693] <err> os: r0/a1: 0x00000003 r1/a2: 0x00000001 r2/a3: 0x00000001
[00:00:27.460,693] <err> os: r3/a4: 0x00024c28 r12/ip: 0x00000000 r14/lr: 0x00007a01
[00:00:27.460,693] <err> os: xpsr: 0x61000000
[00:00:27.460,723] <err> os: Faulting instruction address (r15/pc): 0x00007a0c
[00:00:27.460,723] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:27.460,723] <err> os: Current thread: 0x200014f0 (unknown)
[00:00:27.807,037] <err> os: Halting system
In central_hr:
*** Booting Zephyr OS build v2.3.0-rc1-368-g8fecd4b32d9d ***
Bluetooth initialized
Scanning successfully started
[DEVICE]: 22:52:48:f8:a5:91 (random), AD evt type 3, AD data len 31, RSSI -45
[DEVICE]: 63:47:03:75:bc:8f (random), AD evt type 0, AD data len 29, RSSI -41
[AD]: 1 data_len 1
[AD]: 3 data_len 6
Connected: 63:47:03:75:bc:8f (random)
MTU: 23
Exchange pending...[00:00:00.006,042] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)MTU exchange successful (247)
[00:00:00.006,042] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.006,042] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.3 Build 0
[00:00:00.006,683] <inf> bt_hci_core: Identity: ea:15:23:d0:ac:03 (random)
[00:00:00.006,683] <inf> bt_hci_core: HCI: version 5.2 (0x0b) revision 0x0000, manufacturer 0x05f1
[00:00:00.006,683] <inf> bt_hci_core: LMP: version 5.2 (0x0b) subver 0xffff
[00:00:00.688,507] <err> bt_conn: Unable to allocate TX context
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:370
k_sem_take failed with err -11
[00:00:11.095,764] <err> os: r0/a1: 0x00000003 r1/a2: 0x00000001 r2/a3: 0x00000001
[00:00:11.095,764] <err> os: r3/a4: 0x0002142a r12/ip: 0x00000000 r14/lr: 0x0000666d
[00:00:11.095,764] <err> os: xpsr: 0x61000000
[00:00:11.095,764] <err> os: Faulting instruction address (r15/pc): 0x00006678
[00:00:11.095,794] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:11.095,794] <err> os: Current thread: 0x2000117c (unknown)
[00:00:11.702,239] <err> os: Halting system
To Reproduce Use the branch: https://github.com/cvinayak/zephyr/commit/a2289565363a210e33b8956eac46df55e19a790a
Steps to reproduce the behavior: Build and flash peripheral
- mkdir -p build/peripheral; cd build/peripheral
- cmake -GNinja -DBOARD=nrf52dk_nrf52832 …/…/samples/bluetooth/peripheral
- ninja
- ninja flash
Build and flash central_hr 5. mkdir -p build/central_hr; cd build/central_hr 6. cmake -GNinja -DBOARD=nrf52dk_nrf52832 …/…/samples/bluetooth/central_hr 7. ninja 8. ninja flash
- Open two terminals,
minicom -D /dev/ttyACMx - Observe both peripheral and central_hr fail after connection
Expected behavior peripheral and central_hr should connect, transfer write commands, perform SMP pairing, be encrypted and continue to transder write commands.
Impact showstopper
Environment (please complete the following information):
- OS: Linux
- Toolchain: Zephyr SDK 0.11.2
- Commit SHA or Version used: https://github.com/cvinayak/zephyr/commit/a2289565363a210e33b8956eac46df55e19a790a
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 18 (11 by maintainers)
Anylsis: Symmetrical deadlock in both devices in the connection between TX and RX thread. TX thread has filled up the all the buffers in the controller and has blocked waiting for num complete events. Controller has filled up all it’s receive buffers and the RX thread is delivering ACL packets as fast as it can. The RX thread receives an HCI event (phy update complete) and blocks on sending an HCI command waiting for the TX thread to send the command. The TX thread will not be able to proceed until a num complete arrives on both sides. The RX thread will not be able to proceed until TX thread is unblocked on both sides. Receive buffer is full on both sides so no new packets can be transmitted, meaning no new completes will happen.
@xudongzheng That behavior is documented here: https://github.com/zephyrproject-rtos/zephyr/blob/main/include/bluetooth/gatt.h#L927L930
That is however not the same as is happening in this case, you haven’t encountered this issue.