nodemcu-firmware: SDK 3.0 crashes during file.format()

@EgorOA @NicolSpies reported in PR https://github.com/nodemcu/nodemcu-firmware/pull/2757#issuecomment-493603965 that the build craters when trying to format the SPIFFS.

Test code

Boot image with preformatted SPIFFS and then execute file.format(). I’ve added some diagnostic code to (temporarily) remove the start-up automatic format and entry /exit bookends around the call to spi_flash_erase_sector(sector_id) in the platform_flash_erase_sector() wrapper. This shows that the cpu resets during the spi_flash_erase_sector() call typically after 10-30 erase calls.

NodeMCU version

Current dev

Hardware

Wemos D1 Mini

Discussion

I have reverse compiled the spi_flash_erase_sector() in libmain.a(user_interface.o) and this does what I would expect:

int spi_flash_erase_sector(int n) {
  if (!protect_flag || !spi_flash_erase_sector_check(n))
    return SPI_FLASH_RESULT_ERR;
  spi_flash_check_wr_protect();
  system_soft_wdt_feed();
  Cache_Read_Disable_2();
  int s = SPIEraseSector(n);
  Cache_Read_Enable_2();
  return s;
} 

except that it uses the libmain.a(spi_flash.o) Cache_Read_Disable_2() and Cache_Read_Enable_2() instead of the equivalent BootROM Cache_Read_Disable() and Cache_Read_Enable(). These seem to be more limited in scope in that they apparently disable/enable the Icache without flushing it.

The stochastic nature of the crash strongly suggests to me that we have an ISR firing which includes Icached code. More investigation is needed

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (3 by maintainers)

Most upvoted comments

Thanks @marcelstoer @jmattsson for staying on my case. I’ve found it. Here is my MCV

static int test_bulk_erase (lua_State *L) {
  uint32_t i, j, sa, ss;
  ss = platform_flash_get_partition (NODEMCU_SPIFFS0_PARTITION, &sa);
  sa /= INTERNAL_FLASH_SECTOR_SIZE; 
  ss /= INTERNAL_FLASH_SECTOR_SIZE;
  for (j=1; j<10;j++) {
    // print prolog and wait until UART empty 
    system_set_os_print(1);
    os_printf ("Try %u: erasing sectors %u to %u\n", j, sa, sa+ss-1);
    os_delay_us(50000); system_soft_wdt_feed();
    // now erase the spiffs partition
    for (i = sa; i < sa+ss; i++) {
      spi_flash_erase_sector(i);
    }
  }
  return 0;
}

Running test stub creates the crash with the fatal exception reported.

> test.bulk_erase() -- This will crash with a ~50% chance in any one loop
Try 1: erasing sectors 116 to 1023
Fatal exception 0(IllegalInstructionCause):
�pc1=0x4023b774, epc2=0x00000000, epc3=0x00000000, excvaddr=0x402711a8, depc=0x00000000
 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

And rebooting into GDB with an x/i 0x4023b380 shows that this is chm_get_current_channel is a .text psect entry in libnet80211.a(wl_chm.o). One of the things that I had to do was to regress the eagle.app.v6.ld changes into nodemcu.ld. This moved libnet80211.a into irom0.

Well not quite – in that it actually moved .literal.* .text.* into irom0, but there is a single .text entry in libnet80211.a for chm_get_current_channel which is clearly called by one of the libphy or libpp ISRs.

Fix this one-line in nodemcu.ld and my test.bulk_erase() now executes solidly. I’ll do the push. Sorry guys.

dev is currently broken. I will do an MCV test case later today, which is just a flash erase loop. AFAIK, the only think that could cause this to crash is if some ISR code is invoking iRom0 routines. But I can’t see where any of our ISRs are firing in this loop and I’ve reviewed our ISR code, so I can only guess that the problem lies somewhere in the SDK code base.

If this is the case then dev so stay broken whilst it is based on SDK 3.x.

And free heap size is dramatically low … only 41456 free bytes

When I first got involved with the project we had ~15Kb heap for data and code. 🤣

With each SDK up-issue Espressif add features and functionality. These constantly generate code growth within the SDK libraries. Everyone is bouncing against the iRAM limit so they’ve been systematically moving code from iRAM to iROM to keep applications linkable. This is undoubtedly impacting system stability either because of SDK internal ISR-triggered code being hoisted into iROM or breaking simple time-critical assumptions because of the slower execution when moving instruction fetch from iRAM to iROM.

As the Espressif codebase is closed source, we as a community have no ability to work with the Espressif maintainers on this.

From a project PoV, moving from 2.2.1 to 3.0 seems to have introduced a step drop in stability for no functional gain. I do wonder if the wise course here would be to move dev to dev-3.0 and regress the mainline dev to a 2.2.1 baseline. It would be trivial to backport a simple partition table for our partitions (the SDK effectively ignores these anyway) so that we could still offer the extra runtime / provision-time configuration of LFS, SPIFFS, etc.

OK, I’ve done the PR for this bugfix. I’ve also merged it, since the previous dev was broken.

One aside issue is that now there are two malloc variants dictated by the extra parameter to pvPortMalloc(); if this is true then iRAM memory is allocated preferentially before dRAM. This does beg the Q about the behaviour of os_realloc() which doesn’t pass the parameter, so another bit of reverse compilation gives:

#define os_realloc(p, s)  pvPortRealloc(p, s, "", __LINE__)

void *pvPortRealloc (void *p, size_t n, const char *srcFile, unsigned srcLine) {
  if (n == 0)
    vPortFree(p, srcFile, srcLine);
    return 0;
  }
  void *q = pvPortMalloc(n, srcFile, srcLine, false);
  if (q && n) {
    ets_memcpy(q, p, n);
    vPortFree(p, srcFile, srcLine);
    }
  return q;
}

in other words it forces dRAM allocation which is what I want. We have about 5K of iRAM free in standard builds, but including gdbstub uses all of this up, so I can’t really rely on being able to allocating data in iRAM statically. However we might be able to use explicit os_malloc_iram() during startup to allocate what are effectively static structures in iRAM if memory is available.