esp-idf: ESP32 hangs on stack overflow (IDFGH-8665)
Answers checklist.
- I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
- I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
- I have searched the issue tracker for a similar issue and not found a similar issue.
IDF version.
v4.4
Operating System used.
Linux
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
No response
Development Kit.
ESP32-DEVKITS
Power Supply used.
USB
What is the expected behavior?
Print error and reboot
What is the actual behavior?
Enters panic handler and remains in hang state and requires a next power cycle to reboot.
Steps to reproduce.
Unfortunately we cannot share the code here.
- Initialize a task with low stack space.
- When a stack overflow occurs, the code enters the panic_handler() function but does not execute it completely. Note : This does not occur on all stack overflows. Most of the time a stack overflow occurs the ESP32 prints the task in which the stack overflow occurred and reboots normally. sdkconfig.txt
Debug Logs.
No response
More Information.
Module : ESP32-WROOM-32E-8MB
External Libraries : esp-aws-iot v3.1.x
What we have noticed is, the ESP32 halts and goes into a state similar to reset as all GPIO’s are in their reset state and stays there.
This happens on a stack overflow, however instead of rebooting the ESP32 gets stuck somewhere in the panic_handler() function, we presumed this after observing the gpio states as mentioned below
We have added two gpio_set_level() in the panic_handler() function in panic_handler.c, Under normal resets both gpio 4 and 2, are set to logic 3.3v. However when the esp32 goes in hang state only gpio 4 is set, and gpio 2 remains low.
static void panic_handler(void *frame, bool pseudo_excause)
{
panic_info_t info = { 0 };
gpio_set_level(4, 1);
/*
* Setup environment and perform necessary architecture/chip specific
* steps here prior to the system panic handler.
* */
int core_id = cpu_hal_get_core_id();
// If multiple cores arrive at panic handler, save frames for all of them
g_exc_frames[core_id] = frame;
#if !CONFIG_ESP_SYSTEM_SINGLE_CORE_MODE
// These are cases where both CPUs both go into panic handler. The following code ensures
// only one core proceeds to the system panic handler.
if (pseudo_excause) {
#define BUSY_WAIT_IF_TRUE(b) { if (b) while(1); }
// For WDT expiry, pause the non-offending core - offending core handles panic
BUSY_WAIT_IF_TRUE(panic_get_cause(frame) == PANIC_RSN_INTWDT_CPU0 && core_id == 1);
BUSY_WAIT_IF_TRUE(panic_get_cause(frame) == PANIC_RSN_INTWDT_CPU1 && core_id == 0);
// For cache error, pause the non-offending core - offending core handles panic
if (panic_get_cause(frame) == PANIC_RSN_CACHEERR && core_id != esp_cache_err_get_cpuid()) {
// Only print the backtrace for the offending core in case of the cache error
g_exc_frames[core_id] = NULL;
while (1) {
;
}
}
}
// Need to reconfigure WDTs before we stall any other CPU
esp_panic_handler_reconfigure_wdts();
esp_rom_delay_us(1);
SOC_HAL_STALL_OTHER_CORES();
#endif
esp_ipc_isr_stall_abort();
if (esp_cpu_in_ocd_debug_mode()) {
#if __XTENSA__
if (!(esp_ptr_executable(cpu_ll_pc_to_ptr(panic_get_address(frame))) && (panic_get_address(frame) & 0xC0000000U))) {
/* Xtensa ABI sets the 2 MSBs of the PC according to the windowed call size
* Incase the PC is invalid, GDB will fail to translate addresses to function names
* Hence replacing the PC to a placeholder address in case of invalid PC
*/
panic_set_address(frame, (uint32_t)&_invalid_pc_placeholder);
}
#endif
if (panic_get_cause(frame) == PANIC_RSN_INTWDT_CPU0
#if !CONFIG_ESP_SYSTEM_SINGLE_CORE_MODE
|| panic_get_cause(frame) == PANIC_RSN_INTWDT_CPU1
#endif
) {
wdt_hal_write_protect_disable(&wdt0_context);
wdt_hal_handle_intr(&wdt0_context);
wdt_hal_write_protect_enable(&wdt0_context);
}
}
// Convert architecture exception frame into abstracted panic info
frame_to_panic_info(frame, &info, pseudo_excause);
gpio_set_level(2, 1);
// Call the system panic handler
esp_panic_handler(&info);
}
As for Logs, the ESP32 just stops printing out data over the UART, and does not reach the part where the the Backtrace is printed out.
However upon recreating the issue multiple times, we did observe the back trace printed once, as shown below…

Also, here are some more verbose logs

About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 20 (6 by maintainers)
@kewelspintly @brunohpg Here’s a small update about our findings. We’ve been able to recreate the issue consistently, and it seems like the
SOC_HAL_STALL_OTHER_CORES()is indeed causing the CPUs to get stuck. I’ve attached a small example (run onv4.4.3that can instantly recreate the issue). Steps to create:git checkout tags/v4.4.3CONFIG_ESP_INT_WDT=nb.CONFIG_ESP_TASK_WDT=nc. All other configurations can be left as defaultCode Snippet
What’s strange is that when the CPUs get stuck,
WDT_STAGE_ACTION_RESET_SYSTEMhas no effect, butWDT_STAGE_ACTION_RESET_RTCis still able to reset the system. I’ll check with the hardware team as to whyesp_cpu_stall()can cause the calling CPU to get stuck.As for a quick workaround, commenting out
SOC_HAL_STALL_OTHER_CORESin the panic handling code should be fine in most cases. Not stalling the other core during a panic should only be an issue in the edge case where both cores panic at the same time.Does someone know if ESP-IDF V5.0.1 is also affected?
@Dazza0 any news related to this topic? Is there any patch available?
@Dazza0 @igrr
This issue has been marked as “Status: Selected for Development” for 1 year. How is the status now?
@Dazza0 @igrr Still no fix for this issue?
@vshymanskyy There is this diagram in ESP32 TRM which tries to explain the difference:
There is a bit of a terminology mixup here, WDT “System” reset refers to “Core system”, labelled as “Core” in the diagram above. RTC reset resets both RTC and Core. The inline comment refers to “Core system” as “main system”…
So, the difference is that the power management part of the chip and the RTC domain are reset when RTC_WDT_STAGE_ACTION_RESET_RTC is used.
The likely reason why using RTC_WDT_STAGE_ACTION_RESET_RTC fixes CPU reset is that the registers controlling CPU stall are in the RTC domain, so resetting the RTC domain has a side effect of un-stalling the CPU.