esp-idf: NimBLE scan stops generating results, then crashes when resumed. (IDFGH-2001)

Environment

  • Development Kit: TinyPICO
  • Module or chip used: ESP32-PICO-D4
  • IDF version (run git describe --tags to find it): v4.0-beta1, v4.1-dev
  • Build System: MicroPython
  • Compiler version (run xtensa-esp32-elf-gcc --version to find it): (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 5.2.0
  • Operating System: Linux
  • Power Supply: USB

Problem Description

Using NimBLE, running a ble_gap_disc, the stack eventually stops generating new scan results (i.e. the ble_gap_event_fn passed to ble_gap_disc no longer gets called).

If I stop scanning (i.e. ble_gap_disc_cancel), then start again, it fails with:

ASSERT_PARAM(64 0), in rwble.c at line 235
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Core 0 register dump:
PC      : 0x40087aad  PS      : 0x00060034  A0      : 0x8008b964  A1      : 0x3ffbe4d0  
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x60008054  
A6      : 0x3ffbdb60  A7      : 0x60008050  A8      : 0x80087aad  A9      : 0x3ffbe4b0  
A10     : 0x00000004  A11     : 0x00000000  A12     : 0x6000804c  A13     : 0xffffffff  
A14     : 0x00000000  A15     : 0xfffffffc  SAR     : 0x00000004  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x400879e5  LEND    : 0x400879ec  LCOUNT  : 0x00000000  
Core 0 was running in ISR context:
EPC1    : 0x401c2722  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40087aad

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x40087aaa:0x3ffbe4d0 0x4008b961:0x3ffbe4f0 0x4008c54f:0x3ffbe510 0x40085ea5:0x3ffbe530 0x401c271f:0x3ffbbcb0 0x40152d2f:0x3ffbbcd0 0x40095495:0x3ffbbcf0 0x40093fe9:0x3ffbbd20

Core 1 register dump:
PC      : 0x401c2722  PS      : 0x00060f34  A0      : 0x80152d32  A1      : 0x3ffbc430  
A2      : 0x3ffc289d  A3      : 0x00060f20  A4      : 0x800941df  A5      : 0x3ffcfee0  
A6      : 0x00000003  A7      : 0x00060023  A8      : 0x8010f168  A9      : 0x3ffbc420  
A10     : 0x00000003  A11     : 0x00060f23  A12     : 0x00060f20  A13     : 0x3ffcfec0  
A14     : 0x00000003  A15     : 0x00060023  SAR     : 0x0000001d  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x401c271f:0x3ffbc430 0x40152d2f:0x3ffbc450 0x40095495:0x3ffbc470 0x40093fe9:0x3ffbc4a0

I have been investigating this issue that was reported by MicroPython users. The typical time to failure was several hours, but I was able to repro it much faster with some devices sending advertisements at a high rate (50 per second). Usually takes about 10-15 mins for the issue to occur.

The most important detail about my configuration is that it’s running inside MicroPython, but I can repro this if I turn gap_scan_cb into a no-op, so that should hopefully exclude any issues with what we’re doing in the callback. No other BLE operations were active. It’s a passive scan, with interval=30ms, window=30ms, duration=forever. Another detail is that NimBLE is pinned to core 1 (same core as the MicroPython task), but I saw this problem when it was on the default core (0). The main MicroPython task is blocked on the UART (for the REPL) via ulTaskNotifyTake.

MicroPython doesn’t use IDF’s makefiles or cmake, but it does use sdkconfig. I’ve attached the input and generated sdkconfig.h below.

I wonder if this might be related to https://github.com/espressif/esp-idf/issues/4001 (but that was on Bluedroid, not NimBLE). I can repro with both v4.0-beta1 and v4.1-dev.

Here’s a disassembly of r_rwble_isr which seems to be raising the assert above https://gist.github.com/jimmo/43ba1da440fcbebfab89c70e345f368b (I’ve marked the relevant line with ********) I’ve attached the elf below. Built from https://github.com/jimmo/micropython/tree/ble-fixes

Other items if possible

application.elf.zip sdkconfig.zip

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 34 (26 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @jimmo, apologies for late reply. FYI, I was able to reproduce the issue with high advertisement rate. Though I am still looking into the root-cause, can you try disabling Bluetooth Modem Sleep ? From menuconfig just follow this path : Component config > Bluetooth > Bluetooth controller > MODEM SLEEP Options. Alternatively you can comment out CONFIG_BTDM_MODEM_SLEEP and CONFIG_BTDM_MODEM_SLEEP_MODE_ORIG from your sdkconfig.

Please share your valuable observations.

Hi @prasad-alatkar, we also observe this issue in 4.0-rc1. We urgently need a workaround for this. Can you give some more details on how to avoid this issue or how we could detect and recover from it?

Hi @jimmo,

That’s right, we don’t move the v4.0 release tag. When @prasad-alatkar talks about “backported to v4.0”, he means that the fix was merged into the release/v4.0 branch and will be part of the v4.0.1 bugfix release once that’s available.

We use these “release branches” to maintain bugfix updates for each major & minor version, so in this case release/v4.0 is the branch that contains the v4.0 tag, and future commits on this branch will be tagged for v4.0.1, v4.0.2, etc.

This page in the ESP-IDF docs has more details about our versioning scheme: https://docs.espressif.com/projects/esp-idf/en/stable/versions.html

At this stage, if you plan to stick with stable releases then the only option is to wait for the v4.0.1 release. However if you want to check out the latest commit on the release/v4.0 branch then you can test this fix before it’s released.

Hi @jimmo @0neblock apologies for delayed response. @csushantk is debugging this issue internally. I will update you once the issue is fixed.

@prasad-alatkar Thanks for looking into this!

Can you please confirm if you have not set disc_params.filter_duplicates = 1?

We set filter_duplicates to 0.

Can you please tell me where exactly you are calling ble_gap_dic_cancel?

We call cancel from the main MicroPython task.

To elaborate – the MicroPython firmware gives me a REPL prompt on the device, from which I call a Python function which internally calls ble_gap_disc, with interval and window both set to 30ms.

When the disc callback is invoked, it copies the data into a ringbuffer and notifies the MicroPython task, which then schedules a Python callback that prints out the scan data. (When I referred to making the callback a no-op, this is the code I’m talking about)

After some time interval (usually 10-15 mins in my testing at ~50 results/s, but I’ve had reports from users that this can take many hours at a slower rate), I stop seeing the print output from my Python callback.

So that’s where I interactively stop the scan from the REPL, then re-start it, at which point I see the crash described in rwble.c.

The main MicroPython task is created from app_main().

    xTaskCreatePinnedToCore(mp_task, "mp_task", MP_TASK_STACK_LEN, NULL, MP_TASK_PRIORITY, &mp_main_task_handle, MP_TASK_COREID);

MP_TASK_COREID is 1. We also pin NimBLE to core 1.