zephyr: Bluetooth: BLE stack notifications blocks host side for too long (`drivers/bluetooth/hci/spi.c` and `hci_spi`)

First, I have no pretention in stating it is a bug, it may be the normal way of working of the BLE stack architecture at this moment. In this case, it could be reclassified as an enhancement request if relevant. I may be expecting too much.

Preliminary context

I am using a classic dual-chip configuration:

  • Host-side : custom board (stm32h7, similar to a nucleo-h743zi), using a host-build only
    • 3 services (HID, BAS, DIS)
    • HID reporting rate @133Hz (7.5ms)
  • Controller-side : standard hci_spi sample project flashed on nrf52840 addon board connected through SPI@1MHz

Describe the bug

From the host, using a simple bt_gatt_notify_cb to update a BLE service characteristic (let’s say a 17-bytes long HID report update). I have noticed this call is blocking (OK) for a unexpectedly long period of time (NOK) of approximately 1.5ms degrading others thread “performances” in a non-negligible way.

It’s basically a 1.5ms penalty every 7.5ms that is spent in a busy-wait style fashion starving other actions in the system. More details below on my investigations.

On the other hand, a similar system reporting through USB-HID at the same rate, do not suffer from the same deficiency and a reasonable 0.1 <-> 0.7ms is spent.

  • What have you tried to diagnose or workaround this issue?
  1. Independently of my full system I have been able to reproduce this blocking behavior by setting up a sample test application for the host side, reporting a classic BAS notification (battery level %) periodically (every 50ms down to every 7.5ms (target)) and measuring by software the time spent for a bt_gatt_notify_cb operation.
// Sample test - pseudo code
period = 50ms;
while(1) {
  time;
  bt_gatt_notify(); // reporting BAS notification
  elapsed = now - time;
  printf("%d", elapsed); // connected:~750us <--> ~980us, not connected: ~5us
  wait( period - elapsed );
}

The host-side prj.conf, I use for the sample test is provided below. prj.conf.txt

  1. I have also tried to pimp my BLE stack configuration using the throughput maximization sample project but with no effect.

  2. When debugging, I noticed that the vast majority of time was spent in “waiting” for a rescheduling operation to complete in the implementation of the BLE stack, for some reason.

In a nutshell my call stack (host):

bt_gatt_notify_cb // app waiting here
  // GATT
    // ATT
     // ACL
       // KERNEL QUEUE - k_queue_append_list
        z_reschedule(&queue->lock, key);    <<<< Waiting here for ~80% of the time
     // HCI_SPI transfer happening at some point
   
   // ... returns and continues normally up to the next notification

  // Transfer OTA happening on controller -> at next connection interval -> OK   
  // callback notification sent received -> OK
  1. I have not tried yet but I envisioned moving to a single-chip configuration (Host + Controller using HCI RAM) running on nrf52840 and connecting my current “Host” MCU with a custom protocol over SPI to receive raw notifications.

This way I could have a fully externalized BLE stack not impacting the rest of my system (hopefully) With the big cons of maintaining two FWs instead of a single one…

Do you think it’s a valid way of working around the issue?

  • Is this a regression? If yes, have you been able to “git bisect” it to a specific commit? I don’t think it’s a regression, probably the way it has always worked

Expected behavior

From a system architecture point of view, I was expecting the dual-chip configuration to work in tandem:

  • Host-side -> serializing and publishing notification very fast over HCI -> achieving a minimal time waiting
  • Controller-side -> offloading the full BLE-OTA transfer
    • Tuned with sufficiently large HCI events queues according to the report rate/size/types

It’s a pity to have two powerful MCUs (STM32H7 on host + nrf52840 on controller) and having such a poor behavior.

Impact

As stated this “poor notification performances” are very annoying as it considerably degrades the full system performance, and sometimes delays time-critical operations (acquisition threads…).

Environment (please complete the following information):

  • Building on Windows 10, connecting my device (BLE) on the same laptop.
  • Toolchain
    • Zephyr SDK 0.14.1
    • Zephyr v3.0 > Similar behavior on Zephyr v3.1

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (14 by maintainers)

Most upvoted comments

Hi, yeah we’ll try with UART. Yeah we were using two 840s, 1 host 1 ctlr. We’ve successfully reproduced the issue on the stm32f303re nucleo board (only fast stm32 I had). Unfortunately it doesn’t have trace pins so we’ll have to make do without a trace probe, the investigation will be a bit longer.

edit: tried with UART, we get a ~170us processing time (which is still high), at 1Mbits. We will try to investigate what’s happening.

Hi again,

As you had a concern on the HCI_SPI driver implementation, I investigated a bit further and realized that adding a delay (randomly placed and tuned at 200us) in the bt_spi_send function was totally solving my issue.

Making the overall bt_gatt_notify call duration -> ~15us (instead of 1.5ms as stated in the issue) which is in line with what I expected.

// In zephyr/drivers/bluetooth/hci/spi.c
// static int bt_spi_send(struct net_buf *buf)
// line 418
	/* Allow time for the read thread to handle interrupt */
	while (true) {
		pending = gpio_pin_get(irq_dev, GPIO_IRQ_PIN);
		if (pending <= 0) {
			break;
		}
		k_sleep(K_MSEC(1));
	}
	k_sleep(K_USEC(200)); // <<<<< Here
	k_sem_take(&sem_busy, K_FOREVER);
// line 430

It seems that there is some kind of data race in there, what’s interesting is that a simple k_yield is not sufficient to solve the issue.

@flo-carrere

I was referring to the STM32M4 inside

There is no STM32M4 (STmicro SoC) in nordic chips. I suppose you’re rather referring to the Cortext-M4 CPU core.

In any case >1ms processing is excessive I totally agree. We have tests with real HW downstream (in nRF Connect SDK) that measure all kinds of things, including performance, so that comes as a surprise to me. The performance tests are measuring throughput so it’s possible that we are missing a test case though, I’ll double-check.

I’m definitely suspecting either a misconfiguration or the HCI driver itself, I will try to reproduce it but right now #25917 is my priority.