esp-idf: BT Controller - Stops Scanning or responding after random amount of time (IDFGH-1781)

Brief

I have been having a problem with the Bluedroid BT Controller Scanning function for a few weeks now, and after trying many different things, I am stuck and am not sure what else I can try. The crux of the issue is that the BLE Scanning feature will work for a large amount of time - up to 3 days, then just fail silently, with the whole BT controller seemingly shutting down.

Problem Description

I have a BLE Scanning app that is working well for the most part. It spends most of its time performing an active scan for other BLE devices that are advertising a service UUID and some custom manufacturer data. It receives an advertisement from a sensor around every 1 second, but I can have anywhere from 1-10 sensors within range at any one time.

After a completely random period of time, sometimes 20 minutes, sometimes 3 days. The App will stop receiving ESP_GAP_SEARCH_INQ_RES_EVT events from the bt layer, even though it should still be receiving advertisements form multiple devices, with no indication from any underlying BT Controller debugging that anything has happened. This happens no matter how many sensors I have within range of the ESP, advertising the device, it even happens when I have no sensors advertising, and the general BVLE background advertisements are relatively low.

The free heap memory of the app stays the same (~140kB free memory at any one time), so I can rule out a memory leak on the app side, and the rest of the application keeps running normally, albeit with more computation time from the RTOS (indicated by a loop counter that increases when this error happens), So clearly some of the BT Tasks have stopped running.

When the error happens, I can also see that the ESP itself DOES STOP performing Active scanning, as The sensors I use flash an LED whenever they receive a SCAN_REQUEST from the ESP32 Hardware MAC Address, and this stops happening as soon as the error starts.

If I try and recover from the error, by issuing a command such as esp_ble_gap_start_scanning() - which responds ESP_OK, I get a HCI timeout error printed: BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c. At the moment, trying to perform a bt command after the error, and getting this response, is the only indication from the application that something has gone wrong.

I am not using any WiFi functions, so to reduce memory footprint and file size, I have changed the linker script to only include the following libraries in the component.mk of esp32: core rtc phy instead of the usual: core rtc net80211 pp wpa smartconfig coexist wps wpa2 espnow phy mesh

Coredump

coredump This is a coredump taken about 20 minutes after the error occured. I forced this core dump to log by deliberately throwing an IntegerDivideByZero Exception in another task. My hope here is that it saved the task state of the BT tasks, which your team can use internally to see the task state. If you require My APP ELF I can provide this by email.

Debug Log

This is a log showing the lack of errors I receive when the error happens. As you can see, the application was running for 2.5 days before the error occured. The ‘BMS’ TAG is my application, and the ‘Scanning started’ and ‘scanning stopped’ logs are when my app receives the ESP_GAP_SEARCH_INQ_CMPL_EVT and ESP_GAP_BLE_SCAN_START_COMPLETE_EVT events respectively. In this application, I start a esp_ble_gap_start_scanning operation of 30 seconds, and when I receive a ESP_GAP_SEARCH_INQ_CMPL_EVT event, i set a flag to restart the esp_ble_gap_start_scanning of 30 seconds, in a cycle. Although as discussed later, I have tried changing this interval to anywhere from 30 seconds to 5 minutes, and I have also tried setting the interval to 0 for unlimited, so I only call the start_scan once. In this instance, my pplication received the ESP_GAP_SEARCH_INQ_CMPL_EVT event, so set a flag internally to call esp_ble_gap_start_scanning(30) again, which responded with ESP_OK, but I never received the ESP_GAP_BLE_SCAN_START_COMPLETE_EVT, and about 8 seconds later, I see an error log of command timeout.

I (210356607) BMS[scanning]: Scanning Stopped
I (210356607) BMS[scanning]: Scanning started
I (210363327) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 21, lps: 248
I (210373357) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 20, lps: 249
I (210383367) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 247
I (210386607) BMS[scanning]: Scanning Stopped
I (210386617) BMS[scanning]: Scanning started
I (210393387) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 21, lps: 248
I (210403407) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 248
I (210413457) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 249
I (210416617) BMS[scanning]: Scanning Stopped
I (210423487) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451220, used: 202452, attempted: 981230 | pps: 17, lps: 250
E (210424617) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (210433497) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451220, used: 202452, attempted: 981230 | pps: 25, lps: 250

sdkconfig

Scanning Configuration Used

These are the configurations currently in use, but as you’ll see below I have tried many different

static esp_ble_scan_params_t logging_ble_scan_params = {
    .scan_type              = BLE_SCAN_TYPE_ACTIVE,
    .own_addr_type          = BLE_ADDR_TYPE_PUBLIC,
    .scan_filter_policy     = BLE_SCAN_FILTER_ALLOW_ALL, 
    .scan_interval          = 0x100,
    .scan_window            = 0x100,
    .scan_duplicate         = BLE_SCAN_DUPLICATE_DISABLE
};

Changes Attempted

Below is a list of sdkconfig changes of application setup/operation changes that I have tried, with no success , the same thing occurs.

Unlimited scanning timeout - when done this way, I never even get an indication of hci_layer_timeout, because i never try to run another esp_ble_gap_start_scanning call, so the bt controller will fail silently
Turning CONFIG_BLUEDROID_MEM_DEBUG ON - There is no debugging information around the time the error happens
CONFIG_BT_BLE_DYNAMIC_ENV_MEMORY ON/OFF
CONFIG_BT_ALLOCATION_FROM_SPIRAM_FIRST ON/OFF
CONFIG_BLE_HOST_QUEUE_CONGESTION_CHECK ON/OFF
Scanning interval/window change to many different values: 0x100/0x50, 0x200/0x50, 0x1000/0x100, and a few more.
180 second scanning timeout - This seems to make the error happen quicker, although that may be placebo as I only tested a few times.
Increase CONFIG_BTC_TASK_STACK_SIZE and CONFIG_BTU_TASK_STACK_SIZE
Not calling esp_bt_controller_mem_release(ESP_BT_MODE_CLASSIC_BT) (which I usually call before enabling anything.
Enabling/disbling WiFi coexist (I am not using WiFi function at all

If there is anything else I should try, please let me know.

Apologies for the large Github issue, this error has been troubling me for some time and I would like to know what I can try next. Thank you.

Environment

Key	Value
Development Kit	Custom Board
Module or chip used	ESP32-WROVER-32
IDF version	91f29bef172a082cbd8f0208ed1757ede0e1d635 - tracking release/v3.3
Build System	Make
Compiler version	crosstool-ng-1.22.0-80-g6c4433a
Operating System	macOS
Power Supply	external 3.3V

About this issue

Original URL
State: open
Created 5 years ago
Comments: 61 (19 by maintainers)

Most upvoted comments

We are using ble scan.

On Fri, Jan 7, 2022, 12:48 Juan Ávila @.***> wrote:

We are using 4.3.1 and we are not seeing the issue … <#m_6376715294096464826_> On Thu, Dec 30, 2021, 18:53 Juan Ávila @.> wrote: Hi! Any update about this? I noticed this same issue when using BLE + Classic (v4.4-beta1). Thanks! — Reply to this email directly, view it on GitHub <#4001 (comment) https://github.com/espressif/esp-idf/issues/4001#issuecomment-1003106488>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS7K4PPKJ26AMQ5UEROFOM3UTSE7LANCNFSM4ITBESYQ https://github.com/notifications/unsubscribe-auth/AS7K4PPKJ26AMQ5UEROFOM3UTSE7LANCNFSM4ITBESYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.>

Are you using bluedroid (with BTDM mode)? With only ble scan I can’t reproduce it

— Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/4001#issuecomment-1007312832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS7K4PI44ASQV3SNT4M75BTUU3AI7ANCNFSM4ITBESYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Rokachy on Jan 7, 2022

Same here, even switched to entirely different SoC for a large-volume product because of this.

pschlan on May 18, 2021

We’ve noticed this same issue on multiple versions of the IDF (v3.3.3, v4.1, v4.2, v4.2.1). BLE scanning and BLE server advertising will stop at intermittent times and require a power cycle to fix. Sometimes this happens after an hour, sometimes after a few days. We’ve implemented a daily reboot and have attempted to implement a system where a reboot will happen if the scan event hasn’t fired for awhile. We’ve tried many different configurations, but nothing has worked to resolve the issue. It seems to happen more frequently when a lot of BLE devices are around with scanning. This is a major concern for us.

chrisomatic on May 18, 2021

Hi @chhajedji, Do you have any update on this?

I have included BT controller & VHCI and Bluedroid get status API call in my code where I am observing following error logs when BLE scan is stopped or not responding before I executes esp_restart() for reboot :

yantrr@yantrr-ws2:rel2-test$ egrep "scan_evt timeout|ble_timer_scan_fail_count|Get BLE TX power for PWR_TYPE_SCAN|BT controller status|BT controller status|vhci host check for sending packet to controller|BT_HCI|Version:|IDF" usb2_2021-05-14-22.22.02.log
I (1057) cpu_start: ESP-IDF:          v3.3.5-dirty
I (6090) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3   MacID: 24:62:ab:ef:b2:88
scan_evt timeout
E (3005880) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (3033940) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (3033940) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (3033940) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (3033950) [BLE-MASTER]: BT controller status : 2
I (3033950) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (3033950) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF:          v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3   MacID: 24:62:ab:ef:b2:88
scan_evt timeout
scan_evt timeout
E (6569124) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (6616944) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (6616944) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (6616944) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (6616954) [BLE-MASTER]: BT controller status : 2
I (6616954) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (6616954) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF:          v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3   MacID: 24:62:ab:ef:b2:88
scan_evt timeout
E (4419764) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (4538624) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (4538624) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (4538624) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (4538634) [BLE-MASTER]: BT controller status : 2
I (4538634) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (4538634) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF:          v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3   MacID: 24:62:ab:ef:b2:88

I have included sdkconfig parameters suggestion mentioned by @GianlucaLoi, but didn’t helped in resolving this issue.

vbvchauthmal on May 16, 2021

Hi @chhajedji I used patch from repo esptool which resolved my issue of firmware build and flashing it. Patch link for fixing sign_data error argument : 5b8c2f1b02d0e4b9bac0756b3ead66e8beea428a.

I flashed my firmware with only update of IDF version v3.3.5 and I still observe that this issue of stopping of BLE scan at random interval. I flashed this update on multiple ESP32 based hubs for testing and all are showing this BLE scan stopping issue at random interval.

vbvchauthmal on May 11, 2021