esp-idf: WROVER-B Flash Corrupted in Field (IDFGH-2932)
Environment
- Development Kit: none
- Module or chip used: ESP32-WROVER-B
- IDF version: v3.3.1
- Build System: Make
- Compiler version: xtensa-esp32-elf-gcc (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 5.2.0
- Operating System: macOS
- Power Supply: Battery
Problem Description
Short version: A small percentage of boards in the field have had their flash corrupted so that they are unable to boot. It appears the first half or more of the flash has been overwritten with random data.
Long version: I started this as a forum thread at ( https://esp32.com/viewtopic.php?f=2&t=14719 ) and there are significant details there.
Customers report a boot loop. Upon receipt of the device the console shows a loop of:
rst:0x10 (RTCWDT_RTC_RESET),boot:0x3b (SPI_FAST_FLASH_BOOT)
flash read err, 1000
ets_main.c 371
ets Jun 8 2016 00:22:57
I used esptool.py to pull the flash image off the board and found that the bootloader, partition table, ota_data, first application image and part of the second application image are overwritten with what appears to be random data. The only flash writing that my firmware does is via NVS and OTA APIs. I do not access the flash directly.
I have a hunch that when this has happened it is during rapid power cycles, perhaps due to low battery brownout. I have not been able to confirm that in person yet, but I see some evidence of it.
Reflashing the board via UART recovers it just fine and it operates normally.
A few findings that might be of significance:
- I noticed that even the first 0x1000 bytes of flash contain the random data. On good boards I’ve seen that this is instead 0xff. I don’t flash anything to that area, but I don’t know if the ESP uses it internally for anything. If it doesn’t, it seems odd to me that it would contain data.
- 3.3v EFUSE is set during provisioning, and I verified it was still set on the board. We use MTDI for other purposes so this is required in our use case.
- The entropy (calculated with ent command line tool) of the bad image is twice that of a corresponding good image.
This issue has affected a small but significant number of devices in the field. It results in a completely bricked device that requires return. I’d really appreciate some help or ideas on how this could be happening.
Thanks, Jason
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (13 by maintainers)
We went through this with ~3000 chips in the field 😦 It cost us a ton of money. The fix and explanation is on our website (link below) Good news: You can integrate it in your own next firmware update Bad news: You have to do it before your modules fail.
https://en.hoerbert.com/technology/esp32-critical-fatal-problem-source-in-some-wrover-e-modules/