esp-idf: BT Controller - Stops Scanning or responding after random amount of time (IDFGH-1781)
Brief
I have been having a problem with the Bluedroid BT Controller Scanning function for a few weeks now, and after trying many different things, I am stuck and am not sure what else I can try. The crux of the issue is that the BLE Scanning feature will work for a large amount of time - up to 3 days, then just fail silently, with the whole BT controller seemingly shutting down.
Problem Description
I have a BLE Scanning app that is working well for the most part. It spends most of its time performing an active scan for other BLE devices that are advertising a service UUID and some custom manufacturer data. It receives an advertisement from a sensor around every 1 second, but I can have anywhere from 1-10 sensors within range at any one time.
After a completely random period of time, sometimes 20 minutes, sometimes 3 days. The App will stop receiving ESP_GAP_SEARCH_INQ_RES_EVT events from the bt layer, even though it should still be receiving advertisements form multiple devices, with no indication from any underlying BT Controller debugging that anything has happened. This happens no matter how many sensors I have within range of the ESP, advertising the device, it even happens when I have no sensors advertising, and the general BVLE background advertisements are relatively low.
The free heap memory of the app stays the same (~140kB free memory at any one time), so I can rule out a memory leak on the app side, and the rest of the application keeps running normally, albeit with more computation time from the RTOS (indicated by a loop counter that increases when this error happens), So clearly some of the BT Tasks have stopped running.
When the error happens, I can also see that the ESP itself DOES STOP performing Active scanning, as The sensors I use flash an LED whenever they receive a SCAN_REQUEST from the ESP32 Hardware MAC Address, and this stops happening as soon as the error starts.
If I try and recover from the error, by issuing a command such as esp_ble_gap_start_scanning() - which responds ESP_OK, I get a HCI timeout error printed: BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c.
At the moment, trying to perform a bt command after the error, and getting this response, is the only indication from the application that something has gone wrong.
I am not using any WiFi functions, so to reduce memory footprint and file size, I have changed the linker script to only include the following libraries in the component.mk of esp32:
core rtc phy instead of the usual: core rtc net80211 pp wpa smartconfig coexist wps wpa2 espnow phy mesh
Coredump
coredump This is a coredump taken about 20 minutes after the error occured. I forced this core dump to log by deliberately throwing an IntegerDivideByZero Exception in another task. My hope here is that it saved the task state of the BT tasks, which your team can use internally to see the task state. If you require My APP ELF I can provide this by email.
Debug Log
This is a log showing the lack of errors I receive when the error happens. As you can see, the application was running for 2.5 days before the error occured. The ‘BMS’ TAG is my application, and the ‘Scanning started’ and ‘scanning stopped’ logs are when my app receives the ESP_GAP_SEARCH_INQ_CMPL_EVT and ESP_GAP_BLE_SCAN_START_COMPLETE_EVT events respectively. In this application, I start a esp_ble_gap_start_scanning operation of 30 seconds, and when I receive a ESP_GAP_SEARCH_INQ_CMPL_EVT event, i set a flag to restart the esp_ble_gap_start_scanning of 30 seconds, in a cycle. Although as discussed later, I have tried changing this interval to anywhere from 30 seconds to 5 minutes, and I have also tried setting the interval to 0 for unlimited, so I only call the start_scan once.
In this instance, my pplication received the ESP_GAP_SEARCH_INQ_CMPL_EVT event, so set a flag internally to call esp_ble_gap_start_scanning(30) again, which responded with ESP_OK, but I never received the ESP_GAP_BLE_SCAN_START_COMPLETE_EVT, and about 8 seconds later, I see an error log of command timeout.
I (210356607) BMS[scanning]: Scanning Stopped
I (210356607) BMS[scanning]: Scanning started
I (210363327) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 21, lps: 248
I (210373357) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 20, lps: 249
I (210383367) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 247
I (210386607) BMS[scanning]: Scanning Stopped
I (210386617) BMS[scanning]: Scanning started
I (210393387) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 21, lps: 248
I (210403407) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 248
I (210413457) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 249
I (210416617) BMS[scanning]: Scanning Stopped
I (210423487) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451220, used: 202452, attempted: 981230 | pps: 17, lps: 250
E (210424617) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (210433497) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451220, used: 202452, attempted: 981230 | pps: 25, lps: 250
sdkconfig
Scanning Configuration Used
These are the configurations currently in use, but as you’ll see below I have tried many different
static esp_ble_scan_params_t logging_ble_scan_params = {
.scan_type = BLE_SCAN_TYPE_ACTIVE,
.own_addr_type = BLE_ADDR_TYPE_PUBLIC,
.scan_filter_policy = BLE_SCAN_FILTER_ALLOW_ALL,
.scan_interval = 0x100,
.scan_window = 0x100,
.scan_duplicate = BLE_SCAN_DUPLICATE_DISABLE
};
Changes Attempted
Below is a list of sdkconfig changes of application setup/operation changes that I have tried, with no success , the same thing occurs.
- Unlimited scanning timeout - when done this way, I never even get an indication of hci_layer_timeout, because i never try to run another esp_ble_gap_start_scanning call, so the bt controller will fail silently
- Turning CONFIG_BLUEDROID_MEM_DEBUG ON - There is no debugging information around the time the error happens
- CONFIG_BT_BLE_DYNAMIC_ENV_MEMORY ON/OFF
- CONFIG_BT_ALLOCATION_FROM_SPIRAM_FIRST ON/OFF
- CONFIG_BLE_HOST_QUEUE_CONGESTION_CHECK ON/OFF
- Scanning interval/window change to many different values: 0x100/0x50, 0x200/0x50, 0x1000/0x100, and a few more.
- 180 second scanning timeout - This seems to make the error happen quicker, although that may be placebo as I only tested a few times.
- Increase CONFIG_BTC_TASK_STACK_SIZE and CONFIG_BTU_TASK_STACK_SIZE
- Not calling esp_bt_controller_mem_release(ESP_BT_MODE_CLASSIC_BT) (which I usually call before enabling anything.
- Enabling/disbling WiFi coexist (I am not using WiFi function at all
If there is anything else I should try, please let me know.
Apologies for the large Github issue, this error has been troubling me for some time and I would like to know what I can try next. Thank you.
Environment
| Key | Value |
|---|---|
| Development Kit | Custom Board |
| Module or chip used | ESP32-WROVER-32 |
| IDF version | 91f29bef172a082cbd8f0208ed1757ede0e1d635 - tracking release/v3.3 |
| Build System | Make |
| Compiler version | crosstool-ng-1.22.0-80-g6c4433a |
| Operating System | macOS |
| Power Supply | external 3.3V |
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 61 (19 by maintainers)
We are using ble scan.
On Fri, Jan 7, 2022, 12:48 Juan Ávila @.***> wrote:
Same here, even switched to entirely different SoC for a large-volume product because of this.
We’ve noticed this same issue on multiple versions of the IDF (v3.3.3, v4.1, v4.2, v4.2.1). BLE scanning and BLE server advertising will stop at intermittent times and require a power cycle to fix. Sometimes this happens after an hour, sometimes after a few days. We’ve implemented a daily reboot and have attempted to implement a system where a reboot will happen if the scan event hasn’t fired for awhile. We’ve tried many different configurations, but nothing has worked to resolve the issue. It seems to happen more frequently when a lot of BLE devices are around with scanning. This is a major concern for us.
Hi @chhajedji, Do you have any update on this?
I have included BT controller & VHCI and Bluedroid get status API call in my code where I am observing following error logs when BLE scan is stopped or not responding before I executes esp_restart() for reboot :
I have included sdkconfig parameters suggestion mentioned by @GianlucaLoi, but didn’t helped in resolving this issue.
Hi @chhajedji I used patch from repo esptool which resolved my issue of firmware build and flashing it. Patch link for fixing sign_data error argument : 5b8c2f1b02d0e4b9bac0756b3ead66e8beea428a.
I flashed my firmware with only update of IDF version v3.3.5 and I still observe that this issue of stopping of BLE scan at random interval. I flashed this update on multiple ESP32 based hubs for testing and all are showing this BLE scan stopping issue at random interval.