littlefs: Superblock corruption

Two devices in our test fleet experienced a weird corruption where filesystem is not obviously corrupt but the super block seems to have lot its superness - missing the littlefs tag, for example. I have two flash dumps, here are two first blocks from one of them: https://gist.github.com/rojer/6a9fe5f2947a12b660570534252474f8

I can share full dumps privately, please feel free to reach out.

I have examined console logs from the devices at the time failure occured and it doesn’t seem to be associated with anything unusual - no power loss event, for example, though both happened shortly after a soft reboot. The devices remained responsive, though seemingly having lost ability to read files (file not found errors). Upon rebooting both were unable to mount their filesystems (unsurprisingly) and thus became bricks.

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 28 (27 by maintainers)

Most upvoted comments

From an out-of-band discussion, thanks to work from Nikola Kosturski and @rojer a reproducible test case for the missing magic was found.

What’s interesting is this turned out to not actually be an erroneous case as far as lfs_mount is concerned. lfs_mount is happy as long as a valid superblock entry exists anywhere in the superblock chain.

But this is an issue in terms of documentation and expected behavior. I’ve put up https://github.com/littlefs-project/littlefs/pull/959 to try to fix the state of things.

There is still an issue with the superblock entry disappearing entirely, which shouldn’t happen and cause is unknown.

we have an early indication that it is caused by a change between 2.7.0 and 2.8.2 - we managed to isolate what we think is a reliable repro for this, and going back to 2.7.0 code (with the same fs state), makes it go away (at least the lfs_prog assertion). this needs more work to fully confirm (the week ended, leaving us with a bit of a cliffhanger there 😃 ) but if you focus on the changes between 2.7.0 and 2.8.2, it might be there somewhere.

no, we’re ok wrt stack: 6K and 4K free out of 8K total. additionally, we have a stack canary watchpoint at the end of he stack (CONFIG_FREERTOS_WATCHPOINT_END_OF_STACK=y if you are familiar with idf) - it causes and exception in case of stack overflow.

What exact commit hash are you on? The line numbers don’t seem to quite line up with 2.8.2.

it’s 2.8.2 + assertion + lfs_probe from https://github.com/littlefs-project/littlefs/issues/947 Here it is, for reference: https://gist.github.com/rojer/cc5fd75cd65a7c64df85c73287616bcb

Do you ever move the lfs_file_t struct?

no, it is allocated on open and never moved while the file is open.

Does the lfs_file_t’s .type field make sense?

i think it does. at least doesn’t look obviously stomped over:

(gdb) frame 16
#16 0x420445ea in lfs_file_close (lfs=lfs@entry=0x3fca9538, file=file@entry=0x3fcc2ad0) at lfs.c:6067
6067    lfs.c: No such file or directory.
(gdb) print *file
$1 = {next = 0x0, id = 23, type = 1 '\001', m = {pair = {1, 0}, rev = 512, off = 4096, etag = 1343224952, count = 26, erased = false, split = false, tail = {4294967295, 4294967295}}, ctz = {head = 116, size = 136}, flags = 66818, pos = 136, block = 116, off = 136, cache = {block = 4294967295, off = 128, size = 8, 
    buffer = 0x3fcc2c64 '\377' <repeats 128 times>, "T"}, cfg = 0x3c170d38 <defaults>}

(gdb) print *lfs
$1 = {rcache = {block = 4294967295, off = 128, size = 128, buffer = 0x3fca95c0 '\377' <repeats 128 times>, "\200"}, pcache = {block = 0, off = 0, size = 24, buffer = 0x3fca9644 "\001\002"}, root = {117, 118}, mlist = 0x3fcc2ad0, seed = 596460605, gstate = {tag = 0, pair = {0, 0}}, gdisk = {tag = 0, pair = {0, 0}}, 
  gdelta = {tag = 0, pair = {0, 0}}, free = {off = 116, size = 256, i = 3, ack = 254, buffer = 0x3fca96c8}, cfg = 0x3fca94e8, block_count = 256, name_max = 255, file_max = 2147483647, attr_max = 1022}

re-sent with a link instead

assert that it never returns block 0 or 1 this would only work if you don’t format

we only format externally, when preparing an image, so i can differentiate between usage by device vs mklfs (our image creation tool). i added the assertion to lfs_alloc:

diff --git a/littlefs/lfs.c b/littlefs/lfs.c
index 105915b..d9ad91f 100644
--- a/littlefs/lfs.c
+++ b/littlefs/lfs.c
@@ -662,7 +662,11 @@ static int lfs_alloc(lfs_t *lfs, lfs_block_t *block) {
                     lfs->free.i += 1;
                     lfs->free.ack -= 1;
                 }
-
+// Debug LFS corruption
+// https://github.com/littlefs-project/littlefs/issues/953#issuecomment-1984394787
+#ifndef LFS_TOOLS_BUILD
+                LFS_ASSERT(!(*block == 0 || *block == 1));
+#endif
                 return 0;
             }
         }

also dropped LFS_NO_ASSERT (which we had to save space). will try over the next few days and report.

Is the log_090-20240301-205905.log file stored in the root dir?

yes (everything is in the root dir in our case)

Has anything changed hardware-wise?

nope, we’ve been using the same block layer for ages

Is it possible because of the increased writing this is the first time littlefs has wrapped around storage due to wear-leveling?

that is certainly a possibility.

How large is your storage?

in this case - 448 KB (we have devies with varying fs sizes). about half of it is one big file (index.html.gz), and apart from that it’s a dozen or so small files - config, some storage… and the recent addition are the log files, a set of at most 4, at most 4K each.

see if the last blocks on the storage contain any interesting data

no, last 2 blocks appear to be data blocks (contain what appears to be valid failes - a log file and a config json file respectively).

if any other superblocks (blocks with “littlefs” at offset=8 bytes) exist on the system

no, in fact string littlefs does not appear in the entire image