zephyr: ESP32C3 S3 MCUBoot seg faults/hangs - suspect cache corruption

Describe the bug Apologies for cross posting from discord, but I am not getting much traction there, so I am widening the net 😃

When enabling the MCUBoot bootloader and using sysbuild for the esp32c3 and esp32s3 SOC the application segfaults/or hangs on entry to main. The same application built to use the espressif IDF 2nd stage bootloader runs fine. I can reproduce the issue minimally by building the z-bus hello_world sample.

My strong suspicion is the the cache is not being left in the correct state at the end of the function map_rom_segments(void) in soc/<arch>/espressif_esp32/<soc>/loader.c. Both SOC have very similar code, which surprised me due to the fact one is dual core and there are on different arch, but maybe the HAL sorts this…

Also suspicious in the fn is the clearing of bits in EXTMEM_DCACHE_CTRL1_REG which for both resolves to 0x600c4004, in both cases according to the reference manuals this register address either is marked ‘reserved’ or the bit(s) in question have no documented function.

The other clue that this a cache issue is that in the case of the C3 I can resolve the problem by log printing autoload after the ‘cache_resume’. I think that somehow this triggers a cache reset/reload and the application will run fine. Also issues #64297 and #62172 describe the same behaviour, with the same flaky sometimes works, mostly doesn’t thing.

Also the ROM functions to suspend and resume the cache are sparsely documented so I am not sure what they do, but the autoload value returned from suspend is 0, which also seems strange as I suspect that the cache needs to be in autoload, unless that doesn’t mean what I think it means… also passing true to the resume doesn’t fix things.

To get further clues I have tried delays, setting/clearing cache clk and reset bits as per the manuals (I did think this worked for a bit, except it was the log print I added!). None of these made any difference.

I guess that the loader code has been copied between SOCs and works naively, and intermittently, so has never really been examined. Also the IDF 2nd stage bootloader works as expected. I am happy to work on and contribute fixes, but without some clearer doc for the SOC and/or guidance on how cache setup should happen I am a bit stuck. Also, it may not be a cache issue at all …

To Reproduce west build --sysbuild -p -b esp32c3_devkitm samples/subsys/zbus/hello_world -- -DCONFIG_BOOTLOADER_MCUBOOT=y

west flash

west espressif monitor

Substitute esp32s3_devkitm to do the same for s3

Expected behavior

Sample to run as expected.

ESP-ROM:esp32c3-api1-20210207
Build:Feb  7 2021
rst:0x1 (POWERON),boot:0xd (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:2
load:0x3fcd8000,len:0x1018
load:0x403cad98,len:0x3d8c
load:0x403d0000,len:0x1a24
entry 0x403cc724
[esp32c3] [INF] MCUboot 2nd stage bootloader
[esp32c3] [INF] compiled at Jan 24 2024 09:08:18
[esp32c3] [INF] Chip revision: 3
[esp32c3] [INF] SPI Flash speed: 40MHz, mode: DIO, size: 4MB
[esp32c3] [INF] Image index: 0, Swap type: none
[esp32c3] [INF] Loading image 0 - slot 0 from flash, area id: 1
[esp32c3] [INF] Application start=40381E66h
[esp32c3] [INF] DRAM segment: paddr=00005A68h, vaddr=3FC87010h, size=003AAh (   938) load
[esp32c3] [INF] IRAM segment: paddr=000016F0h, vaddr=40380000h, size=04378h ( 17272) load
[esp32c3] [INF] DROM segment: paddr=00010040h, vaddr=3C000040h, size=016B0h (  5808) (al=00000000) map
[esp32c3] [INF] IROM segment: paddr=00020000h, vaddr=42020000h, size=039B8h ( 14776) map
I: Sensor sample started raw reading, version 0.1-2!
I: Channel list:
I: 0 - Channel acc_data_chan:
I:       Message size: 12
I:       Observers:
I:       - foo_lis
I:       - bar_sub
I: 1 - Channel simple_chan:
I:       Message size: 4
I:       Observers:
I: 2 - Channel version_chan:
I:       Message size: 4
I:       Observers:
I: Observers list:
I: 0 - Subscriber bar_sub
I: 1 - Listener foo_lis
I: From listener -> Acc x=1, y=1, z=1
I: From subscriber -> Acc x=1, y=1, z=1
I: From listener -> Acc x=2, y=2, z=2
I: From subscriber -> Acc x=2, y=2, z=2
I: Pub a valid value to a channel with validator successfully.
I: Pub an invalid value to a channel with validator successfully.

Impact This is a showstopper for our application. We have a product in last stage hardware revision, hoping to get some pilot devices in the field soon. This issue has cropped up as we were merging together some modules for the final app. The z-bus is used in our IPC chain and the MCUBoot for our firmware and deployment, it was putting them both together that surfaced the issue.

Our board is C3 based but I happened to have an S3 and decided to try it to see if that helped find the issue. So my priority is getting the C3 working, but I can help out with the S3 as well if it needs similar repairs.

Logs and console output

ESP-ROM:esp32c3-api1-20210207
Build:Feb  7 2021
rst:0x1 (POWERON),boot:0xd (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:2
load:0x3fcd8000,len:0x1018
load:0x403cad98,len:0x3d8c
load:0x403d0000,len:0x1a24
entry 0x403cc724
[esp32c3] [INF] MCUboot 2nd stage bootloader
[esp32c3] [INF] compiled at Jan 24 2024 08:58:42
[esp32c3] [INF] Chip revision: 3
[esp32c3] [INF] SPI Flash speed: 40MHz, mode: DIO, size: 4MB
[esp32c3] [INF] Image index: 0, Swap type: none
[esp32c3] [INF] Loading image 0 - slot 0 from flash, area id: 1
[esp32c3] [INF] Application start=40381E66h
[esp32c3] [INF] DRAM segment: paddr=00005A5Ch, vaddr=3FC87010h, size=003AAh (   938) load
[esp32c3] [INF] IRAM segment: paddr=000016E4h, vaddr=40380000h, size=04378h ( 17272) load
[esp32c3] [INF] DROM segment: paddr=00010040h, vaddr=3C000040h, size=016A4h (  5796) map
[esp32c3] [INF] IROM segment: paddr=00020000h, vaddr=42020000h, size=039B8h ( 14776) map
E: 
E:  mcause: 7, Store/AMO access fault
E:   mtval: 3c001670
E:      a0: 00000014    t0: bae01309
E:      a1: 3c001670    t1: 0000000f
E:      a2: 3c0016e8    t2: 690ae800
E:      a3: 00000000    t3: 3c394a1e
E:      a4: 3c0016c8    t4: f191ad25
E:      a5: 3c0016d8    t5: 2c6a2555
E:      a6: 005b7e31    t6: f526b1e7
E:      a7: 1ddc51d1
E:      ra: 403808ba
E:    mepc: 42022816
E: mstatus: 00001880
E: 
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x3fc844f8 (unknown)
E: Halting system
(.venv) you@yours:~/zephyrproject/zephyr$ addr2line -e build/hello_world/zephyr/zephyr.elf 3c001670
/home/you/zephyrproject/zephyr/samples/subsys/zbus/hello_world/src/main.c:35

main.c:35, statically defined data, accessed by a LOG_ early in main.

ZBUS_CHAN_DEFINE(acc_data_chan,  /* Name */
		 struct acc_msg, /* Message type */

		 NULL,                                 /* Validator */
		 NULL,                                 /* User data */
		 ZBUS_OBSERVERS(foo_lis, bar_sub),     /* observers */
		 ZBUS_MSG_INIT(.x = 0, .y = 0, .z = 0) /* Initial value */
);

Environment (please complete the following information):

  • OS: GNU/Linux
  • Toolchain: zephyr 0.16.0
  • Main

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Reactions: 2
  • Comments: 23 (10 by maintainers)

Commits related to this issue

Most upvoted comments

I’ll take a couple days to flush out some solid test cases here, and firm up what works best. I think this gives me enough to work around the problem, so that we can take time to find the ‘correct’ fix (whatever that ends up being). I’ll probably end up building a bunch of sample projects and seeing which ones work or not work then adjusting logging levels, and shell features until it breaks. Watching the readelf and counting segments is a nice shortcut. My understanding here is for a specific soc, the number of segments should always be the same? ie: I should always see 8 segments for zephyr applications compiled for the esp32s2?

@sylvioalves let me know anytime if you want me to test something, or gather information.

Finally thanks for the tool examples as well. Even the simple stuff like: -DCONFIG_COMPILER_SAVE_TEMPS=y -DCONFIG_DEBUG_OPTIMIZATIONS=y I never thought to look for.