esp-idf: [TW#20016] "user code done" failures (IDFGH-974)

We are deploying a bunch of esp32’s on an industrial setting, and we are noticing a very worrying lockup that sometimes happens when the power fails, resulting in the dreaded “user code done” message.

After browsing the bootloader codebase I’ve come to the conclusion that this message it is in fact, on the stage 1 bootloader, which poses the issue: I don’t want to have my esp32s to just say “meh we did what we could” I need them to retry after a flash failure or other stuff.

I’ve tried modifying the stage 2 bootloader but I don’t see how could I trigger a software reset from there.

Here’s the log of the failure in question:

[Mar  8 10:43:11.817]
[Mar  8 10:43:11.817] rst:0xc (SW_CPU_RESET),boot:0x17 (SPI_FAST_FLASH_BOOT)
[Mar  8 10:43:11.832] configsip: 0, SPIWP:0xee
[Mar  8 10:43:11.832] clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
[Mar  8 10:43:11.832] mode:DIO, clock div:2
[Mar  8 10:43:11.848] load:0x3fff0018,len:4
[Mar  8 10:43:11.848] load:0x3fff001c,len:5336
[Mar  8 10:43:11.848] load:0x40078000,len:0
[Mar  8 10:43:11.848] ho 12 tail 0 room 4
[Mar  8 10:43:11.863] load:0x40078000,len:14528
[Mar  8 10:43:11.863] entry 0x4007903c
[Mar  8 10:43:11.863] [0;32mI (30) boot: ESP-IDF v3.0-rc1-r4 2nd stage bootloader[0m
[Mar  8 10:43:11.879] [0;32mI (30) boot: compile time 14:24:19[0m
[Mar  8 10:43:11.879] [0;32mI (30) boot: Enabling RNG early entropy source...[0m
[Mar  8 10:43:11.895] [0;32mI (35) boot: SPI Speed      : 40MHz[0m
[Mar  8 10:43:11.895] [0;32mI (40) boot: SPI Mode       : DIO[0m
[Mar  8 10:43:11.910] [0;32mI (44) boot: SPI Flash Size : 4MB[0m
[Mar  8 10:43:11.910] [0;32mI (48) boot: Partition Table:[0m
[Mar  8 10:43:11.926] [0;32mI (51) boot: ## Label            Usage          Type ST Offset   Length   Flags[0m
[Mar  8 10:43:11.926] [0;32mI (59) boot:  0 nvs              WiFi data        01 02 00009000 00004000 00000000[0m
[Mar  8 10:43:11.942] [0;32mI (67) boot:  1 otadata          OTA data         01 00 0000d000 00002000 00000000[0m
[Mar  8 10:43:11.957] [0;32mI (76) boot:  2 app_0            OTA app          00 10 00010000 00180000 00000000[0m
[Mar  8 10:43:11.957] [0;32mI (84) boot:  3 fs_0             SPIFFS           01 82 00190000 00040000 00000000[0m
[Mar  8 10:43:11.973] [0;32mI (92) boot:  4 app_1            OTA app          00 11 001d0000 00180000 00000000[0m
[Mar  8 10:43:11.988] [0;32mI (100) boot:  5 fs_1             SPIFFS           01 82 00350000 00040000 00000000[0m
[Mar  8 10:43:12.004] [0;32mI (109) boot: End of partition table[0m
[Mar  8 10:43:12.004] [0;32mI (113) boot: OTA data 0: seq 0x00000001, st 0x10, CRC 0x157a2b85, valid? 1[0m
[Mar  8 10:43:12.020] [0;32mI (121) boot: OTA data 1: seq 0x00000000, st 0x00, CRC 0x00000000, valid? 0[0m
[Mar  8 10:43:12.020] [0;32mI (128) boot: Disabling RNG early entropy source...[0m
[Mar  8 10:43:12.035] [0;32mI (134) esp_image: segment 0: paddr=0x00010020 vaddr=0x3f400020 size=0x4bad4 (309972) map[0m
[Mar  8 10:43:12.067] [0;32mI (251) esp_image: segment 1: paddr=0x0005bafc vaddr=0x3ffc0000 size=0x02d54 ( 11604) load[0m
[Mar  8 10:43:12.082] [0;32mI (256) esp_image: segment 2: paddr=0x0005e858 vaddr=0x40080000 size=0x00400 (  1024) load[0m
[Mar  8 10:43:12.098] [0;32mI (259) esp_image: segment 3: paddr=0x0005ec60 vaddr=0x40080400 size=0x013b0 (  5040) load[0m
[Mar  8 10:43:12.113] [0;32mI (269) esp_image: segment 4: paddr=0x00060018 vaddr=0x400d0018 size=0x104424 (1066020) map[0m
[Mar  8 10:43:12.434] [0;32mI (649) esp_image: segment 5: paddr=0x00164444 vaddr=0x400817b0 size=0x1263c ( 75324) load[0m
[Mar  8 10:43:12.480] [0;32mI (680) esp_image: segment 6: paddr=0x00176a88 vaddr=0x400c0000 size=0x00068 (   104) load[0m
[Mar  8 10:43:12.958] [0;31mE (680) esp_image: Checksum failed. Calculated 0x8 read 0x64[0m
[Mar  8 10:43:12.958] [0;31mE (684) boot: Failed to verify app image @ 0x10000 (8194)[0m
[Mar  8 10:43:12.974] user code done```

I just want a more reliable boot sequence to what we have right now

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 15 (4 by maintainers)

Commits related to this issue

bootloader: Ensure bootloader never returns to caller * Fixes some "noreturn" functions in bootloader utils which did return (causing fatal CPU exceptions). * Marks bootloader entry as "noreturn", ... — committed to espressif/esp-idf by projectgus 6 years ago
bootloader: Ensure bootloader never returns to caller * Fixes some "noreturn" functions in bootloader utils which did return (causing fatal CPU exceptions). * Marks bootloader entry as "noreturn", ... — committed to catalinio/pycom-esp-idf by projectgus 6 years ago
bootloader_support: Fix enable rtc_wdt for resolve issue with varying supply Eliminates the issue with the lock up in the bootloader due to a power drawdown during its operation. Closes https://gith... — committed to catalinio/pycom-esp-idf by KonstantinKondrashov 6 years ago

Most upvoted comments

We are also planning to use a HW watchdog on our next revision, but we have existing hardware so we’ve been working on a software solution. I’ll share our progress here for anyone facing the same issue.

In bootloader_main, the RTC watchdog is disabled (I assume this is for the case where it was enabled before a software reset.) So we re-enabled it right after the disable with a 5 second timeout to catch any lockups in the bootloader. If you don’t know how to enable it, see code in the function esp_panic_wdt_start() - that code can be copied directly. The main_task() in cpu_start.c disables the watchdog again, so you don’t have to do anything with it in your application code. With this change we haven’t seen any lockups in the second stage bootloader that are not recoverable.

We also found a lockup after a software reset triggered by the brownout detector. We are using the RTC watchdog in our application, but there is also code in the software reset to enable. I don’t understand why, but enabling it twice caused it to no longer work. Once we added a watchdog disable at the beginning of the software reset, the RTC watchdog started to work properly and caught any lockups after brownout but before the second stage bootloader.

Lastly, removing the esp_cpu_stall from the rtc_brownout_isr_handler(), and lowering the brownout level to the lowest (2.43V) made the brownout reset more reliable.

peterwoodman on Jul 12, 2018