lima: VM disk corruption with Apple Silicon
[!TIP]
EDIT by @AkihiroSuda
For
--vm-type=vz, this issue seems to have been solved in Lima v0.19 (https://github.com/lima-vm/lima/pull/2026)
Description
Lima version: 0.18.0 macOS: 14.0 (23A344) VM: Almalinux9
I was trying to do a big compile, using a VM with the attached configuration (vz)
NAME           STATUS     SSH                VMTYPE    ARCH       CPUS    MEMORY    DISK      DIR
myalma9        Running    127.0.0.1:49434    vz        aarch64    4       16GiB     100GiB    ~/.lima/myalma9
The build aborted with:
from /Volumes/Lima/build/build/AthenaExternals/src/Geant4/source/processes/hadronic/models/lend/src/[xDataTOM_LegendreSeries.cc](http://xdatatom_legendreseries.cc/):7:
/usr/include/bits/types.h:142:10: fatal error: /usr/include/bits/time64.h: Input/output error
And afterwards, even in a different terminal, I see:
[emoyse@lima-myalma9 emoyse]$ ls
bash: /usr/bin/ls: Input/output error
I was also logged into a display, and there I saw e.g.
If I try to log in again with:
limactl shell myalma9
each time I see something like the following appear in the display window:
[56247.6427031] Core dump to l/usr/lib/systemd/systemd-coredump pipe failed
Edit: there has been a lot of discussion below, and the corruption can happen with both vz and qemu, and on external (to the VM) and internal disks. Some permutations seem more likely to provoke a corruption than others. I have reproduced my experiments in the table in the following comment below.
About this issue
- Original URL
- State: open
- Created 8 months ago
- Reactions: 1
- Comments: 50 (24 by maintainers)
This may fix the issue for vz:
( Thanks to @wpiekutowski https://github.com/utmapp/UTM/issues/4840#issuecomment-1824340975 @wdormann https://github.com/utmapp/UTM/issues/4840#issuecomment-1824542732 )
My apologies for the delay in replying, but i have been looking into this. The workflow is the same - compile https://gitlab.cern.ch/atlas/atlasexternals using the attached template with various configurations of host, qemu/vz, cores and memory.
TLDR; updating to
6.5.10-1was more stable on M2 (even on ‘shared’ volume/tmp/lima), but apparently worse on M1 Pro (though the M1Pro has more cores and we pushed this a lot harder). Updating to6.6.1was better on M1 Pro (have not tested M2 yet) but got xfs corruption at the very end.With
6.6.1I also disabled sleeping on guest:(from hint here)
Notes:
xfsmeans xfs corruption was reported./var/log/messagesI see :And in the display I see:
So it seems like there are a lot of references to people mentioning issues related to external disks and non-APFS filesystems. I am using the internal disk on my m2 mini with the default APFS filesystem and I’ve experienced disk corruption once but haven’t specifically been able to force it to be reproduced but I haven’t tried very hard to be honest but I did want to point out that maybe external disks and other filesystems may not be the specific cause but may just be easier to trigger compared to internal APFS.
I run Debian Bookworm and after repairing the filesystem with a fsck I did also upgrade my kernel from
linux-image-cloud-arm646.1.55-1to6.5.3-1~bpo12+1in backports.And for me, I’m not using external (to the VM) disks any more - if you look at the table I posted here you will see that in the
Wherecolumn, I’m mostly using/tmpto work in i.e. completely inside the VM. Using an external disk might provoke the corruption earlier, but it’s certainly not the only route to it (though later kernels seem quite a bit more stable).Hey @afbjorklund I’ve been running some more tests, and I just had corruption from
/tmpso it doesn’t cure it (but perhaps it is slightly less likely to happen). Updating the original post.ARM64 atomics have been broken until last year, when I found the issue and got it fixed (it was breaking workqueues which was causing problems with TTYs for me, but who knows what else). 5.14 (released 2021) is definitely broken unless it’s a branch with all the required backports.
Try 6.4, that should work. 6.5.0 was a very recent regression. I would not put much faith in older kernels, especially anything older than 5.18 which is where we started. All bets are off if you’re running kernels that old on bleeding edge hardware like this. Lots of bugfixes don’t get properly backported into stable branches either. Apple CPUs are excellent at triggering all kinds of nasty memory ordering bugs that no other CPUs do, because they speculate/reorder across ridiculous numbers of instructions and even things like IRQs (yes really).
I’ll admit I’m not familiar with Lima. When you say “make it mountable from within the VM”, what does that mean?
Perhaps Lima does this all for you under the hood, but I suppose that I’d need to know exactly what it’s doing to have any hope of understanding what’s going on.
Is this relevant?
(UTM uses vz too)
Looks like people began to hit this issue since September, so I wonder if Apple introduced a regression on that time?
I still can’t repro the issue locally though. (macOS 14.1 on Intel MacBookPro 2020, macOS 13.5.2 on EC2 mac2-m2pro)