esp-idf: ESP32 hangs on stack overflow (IDFGH-8665)

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v4.4

Operating System used.

Linux

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

No response

Development Kit.

ESP32-DEVKITS

Power Supply used.

USB

What is the expected behavior?

Print error and reboot

What is the actual behavior?

Enters panic handler and remains in hang state and requires a next power cycle to reboot.

Steps to reproduce.

Unfortunately we cannot share the code here.

  1. Initialize a task with low stack space.
  2. When a stack overflow occurs, the code enters the panic_handler() function but does not execute it completely. Note : This does not occur on all stack overflows. Most of the time a stack overflow occurs the ESP32 prints the task in which the stack overflow occurred and reboots normally. sdkconfig.txt

Debug Logs.

No response

More Information.

Module : ESP32-WROOM-32E-8MB External Libraries : esp-aws-iot v3.1.x

What we have noticed is, the ESP32 halts and goes into a state similar to reset as all GPIO’s are in their reset state and stays there. This happens on a stack overflow, however instead of rebooting the ESP32 gets stuck somewhere in the panic_handler() function, we presumed this after observing the gpio states as mentioned below

We have added two gpio_set_level() in the panic_handler() function in panic_handler.c, Under normal resets both gpio 4 and 2, are set to logic 3.3v. However when the esp32 goes in hang state only gpio 4 is set, and gpio 2 remains low.

static void panic_handler(void *frame, bool pseudo_excause)
{
    panic_info_t info = { 0 };

    gpio_set_level(4, 1);
    /*
     * Setup environment and perform necessary architecture/chip specific
     * steps here prior to the system panic handler.
     * */
    int core_id = cpu_hal_get_core_id();

    // If multiple cores arrive at panic handler, save frames for all of them
    g_exc_frames[core_id] = frame;

#if !CONFIG_ESP_SYSTEM_SINGLE_CORE_MODE
    // These are cases where both CPUs both go into panic handler. The following code ensures
    // only one core proceeds to the system panic handler.
    if (pseudo_excause) {
#define BUSY_WAIT_IF_TRUE(b)                { if (b) while(1); }
        // For WDT expiry, pause the non-offending core - offending core handles panic
        BUSY_WAIT_IF_TRUE(panic_get_cause(frame) == PANIC_RSN_INTWDT_CPU0 && core_id == 1);
        BUSY_WAIT_IF_TRUE(panic_get_cause(frame) == PANIC_RSN_INTWDT_CPU1 && core_id == 0);

        // For cache error, pause the non-offending core - offending core handles panic
        if (panic_get_cause(frame) == PANIC_RSN_CACHEERR && core_id != esp_cache_err_get_cpuid()) {
            // Only print the backtrace for the offending core in case of the cache error
            g_exc_frames[core_id] = NULL;
            while (1) {
                ;
            }
        }
    }

    // Need to reconfigure WDTs before we stall any other CPU
    esp_panic_handler_reconfigure_wdts();

    esp_rom_delay_us(1);
    SOC_HAL_STALL_OTHER_CORES();
#endif

    esp_ipc_isr_stall_abort();

    if (esp_cpu_in_ocd_debug_mode()) {
#if __XTENSA__
        if (!(esp_ptr_executable(cpu_ll_pc_to_ptr(panic_get_address(frame))) && (panic_get_address(frame) & 0xC0000000U))) {
            /* Xtensa ABI sets the 2 MSBs of the PC according to the windowed call size
             * Incase the PC is invalid, GDB will fail to translate addresses to function names
             * Hence replacing the PC to a placeholder address in case of invalid PC
             */
            panic_set_address(frame, (uint32_t)&_invalid_pc_placeholder);
        }
#endif
        if (panic_get_cause(frame) == PANIC_RSN_INTWDT_CPU0
#if !CONFIG_ESP_SYSTEM_SINGLE_CORE_MODE
                || panic_get_cause(frame) == PANIC_RSN_INTWDT_CPU1
#endif
           ) {
            wdt_hal_write_protect_disable(&wdt0_context);
            wdt_hal_handle_intr(&wdt0_context);
            wdt_hal_write_protect_enable(&wdt0_context);
        }
    }

    // Convert architecture exception frame into abstracted panic info
    frame_to_panic_info(frame, &info, pseudo_excause);

    gpio_set_level(2, 1);
    // Call the system panic handler
    esp_panic_handler(&info);
}

As for Logs, the ESP32 just stops printing out data over the UART, and does not reach the part where the the Backtrace is printed out. However upon recreating the issue multiple times, we did observe the back trace printed once, as shown below… Screenshot from 2022-11-01 19-02-17

Also, here are some more verbose logs Screenshot from 2022-11-03 14-11-47

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 20 (6 by maintainers)

Most upvoted comments

@kewelspintly @brunohpg Here’s a small update about our findings. We’ve been able to recreate the issue consistently, and it seems like the SOC_HAL_STALL_OTHER_CORES() is indeed causing the CPUs to get stuck. I’ve attached a small example (run on v4.4.3 that can instantly recreate the issue). Steps to create:

  1. git checkout tags/v4.4.3
  2. Copy the code snippet below into any project
  3. Disable the following configurations a. CONFIG_ESP_INT_WDT=n b. CONFIG_ESP_TASK_WDT=n c. All other configurations can be left as default
  4. Build and flash the project to an ESP32. It might take a couple of restarts but the CPUs should eventually get stuck.
Code Snippet
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "soc/rtc.h"
#include "hal/wdt_hal.h"
#include "esp_cpu.h"
#include "esp_log.h"
#include "esp_heap_caps.h"

#define CORE0_TAG "core0"
#define CORE1_TAG "core1"

static volatile bool sync_flag_start = false;

static void core0_task(void *arg)
{
    volatile uint32_t *some_array = heap_caps_malloc(sizeof(uint32_t) * 16, MALLOC_CAP_8BIT);
    assert(some_array);

    while (1) {
        sync_flag_start = true;

        // Just loop access the allocated memory
        for (int i = 0; i < INT_MAX; i++) {
            for (int j = 0; j < 16; j++) {
                some_array[j] = j;
            }
        }
        ESP_LOGI(CORE0_TAG, "loop");
    }
}

static void core1_task(void *arg)
{
    volatile uint32_t *some_array = heap_caps_malloc(sizeof(uint32_t) * 16, MALLOC_CAP_8BIT);
    assert(some_array);

    // Wait for core0 to start
    while (!sync_flag_start) {
        ;

    }

    while (1) {
        for (int i = 0; i < 100; i++) {
            // Stall the other core
            esp_cpu_stall(0);
            // Access the allocated memory
            for (int j = 0; j < 16; j++) {
                some_array[j] = j;
            }
            // Unstall the other core
            esp_cpu_unstall(0);
        }
        ESP_LOGI(CORE1_TAG, "loop");
    }
}

void app_main(void)
{
    vTaskDelay(10);
    printf("Hello World\n");

    // Initialize RTC WDT to reset system after 20 seconds
    wdt_hal_context_t rwdt_context;
    wdt_hal_init(&rwdt_context, WDT_RWDT, 0, false);
    wdt_hal_write_protect_disable(&rwdt_context);
    wdt_hal_config_stage(&rwdt_context,
                         WDT_STAGE0,
                         20000 * rtc_clk_slow_freq_get_hz() / 1000,
                         WDT_STAGE_ACTION_RESET_SYSTEM);
    wdt_hal_enable(&rwdt_context);
    wdt_hal_write_protect_enable(&rwdt_context);

    xTaskCreatePinnedToCore(core1_task, "core1", 4096, NULL, 10, NULL, 1);
    xTaskCreatePinnedToCore(core0_task, "core0", 4096, NULL, 10, NULL, 0);
}

What’s strange is that when the CPUs get stuck, WDT_STAGE_ACTION_RESET_SYSTEM has no effect, but WDT_STAGE_ACTION_RESET_RTC is still able to reset the system. I’ll check with the hardware team as to why esp_cpu_stall() can cause the calling CPU to get stuck.

As for a quick workaround, commenting out SOC_HAL_STALL_OTHER_CORES in the panic handling code should be fine in most cases. Not stalling the other core during a panic should only be an issue in the edge case where both cores panic at the same time.

Does someone know if ESP-IDF V5.0.1 is also affected?

@Dazza0 any news related to this topic? Is there any patch available?

@Dazza0 @igrr

This issue has been marked as “Status: Selected for Development” for 1 year. How is the status now?

What is the actual behavior?

Enters panic handler and remains in hang state and requires a next power cycle to reboot.

@Dazza0 @igrr Still no fix for this issue?

@vshymanskyy There is this diagram in ESP32 TRM which tries to explain the difference:

image

There is a bit of a terminology mixup here, WDT “System” reset refers to “Core system”, labelled as “Core” in the diagram above. RTC reset resets both RTC and Core. The inline comment refers to “Core system” as “main system”…

So, the difference is that the power management part of the chip and the RTC domain are reset when RTC_WDT_STAGE_ACTION_RESET_RTC is used.

The likely reason why using RTC_WDT_STAGE_ACTION_RESET_RTC fixes CPU reset is that the registers controlling CPU stall are in the RTC domain, so resetting the RTC domain has a side effect of un-stalling the CPU.