firecracker: [Bug] Processes get stuck after resuming VM from snapshot

Describe the bug

After resuming a VM from snapshot, processes occasionally get stuck. A minimal example is an init binary that just runs while (true) { sleep 100ms ; print 'hello' } - after resuming from snapshot, it only sometimes is able to resume the loop, and other times it gets stuck and does not print anything at all.

To Reproduce

  1. Clone the following repo, which has the minimal code needed to reproduce: https://github.com/bduffany/firecracker-sleep-issue
  2. Make sure you have make, cc (C compiler, from gcc package), glibc-static (for statically linking the init binary), and jq (for parsing firecracker API output)
  3. Run make test

make test is doing the following:

  • Fetches the firecracker v1.4.0 release binary from GitHub as well as vmlinux 4.14 from the https://github.com/firecracker-microvm/firecracker-demo repo
  • Runs firecracker with an initrd where the init binary just loops infinitely, printing “running” then sleeping for 100ms.
  • Streams the VM logs to the kernel (using tail -f in the background)
  • Every 1s, pauses the VM, takes a snapshot, then kills firecracker (SIGTERM), then restarts firecracker, resuming the VM from snapshot.

Expected behaviour

When running make test in that repo, the init binary should print running several times after each resume. But on some resumes, it appears stuck, and does not print anything until the next resume.

Interesting details:

  • The issue does NOT reproduce if I replace the nanosleep syscalls with an NOP loop of 1e9 iterations (for (int i=0; i<1e9; i++) continue;)
  • The issue seems specific to snapshotting, not just pausing and resuming the VM. I tried just doing pause/resume without taking a snapshot and then restarting the firecracker binary in between, but could not reproduce.
  • I could not reproduce this on an Intel CPU so far, only AMD (have tried 2 different Intel machines and 2 different AMD machines).

Environment

  • Firecracker v1.4.0
  • Host kernel v6.2.0 (Ubuntu 22.04)
    • UPDATE: also reproduced on host kernel v5.10 - m6a.metal instance
  • Guest kernel 4.14.55-84.37.amzn2.x86_64, from https://github.com/firecracker-microvm/firecracker-demo
    • Also tried compiling 5.10 using the recommended guest config - the issue still reproduces.
  • Rootfs: none (initrd only)
  • Architecture: x86_64 (AMD Ryzen CPU)
    • UPDATE: Also reproduced with AMD EPYC (m6a.metal instance)
  • GLIBC 2.35 (for init binary)

Additional context

This repro above is a minimal example of a much more troublesome issue where we are having trouble reconnecting to microVMs after resuming them from snapshots. We are running a server inside the VM and are having trouble connecting to over vsock. I suspected that the guest process was “stuck” somehow since when running sleep 1 && print('hello') in a background loop, it sometimes doesn’t print anything. I came up with this minimal reproducer for this behavior.

Checks

  • Have you searched the Firecracker Issues database for similar problems?
  • Have you read the existing relevant Firecracker documentation?
    • I have read the FAQ about guest clock drift / NTP, but this appears more significant than just clock drift since I think nanosleep should work based on relative timing? I could be wrong, though.
  • Are you certain the bug being reported is a Firecracker issue?
    • Not 100% certain, but given that it happens only when loading a snapshot, it seems like it could be Firecracker-related

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 2
  • Comments: 23 (20 by maintainers)

Most upvoted comments

Hi @bduffany. We are still working on the root cause of this issue. We will update you as soon as possible.

Hi @bduffany,

Sorry for the delay, unfortunately we were not able to make progress regarding this issue. We are still tracking the problem and will provide an update once we will have more data to share.

Hi @bduffany, I am currently looking at this. I found that passing lapic=notscdeadline in the kernel boot command line also seems to fix it, so I think that narrows it down a bit. I will continue investigating, but could you confirm?

Hi @bduffany, thanks for providing the repro steps. We are working on this issue and to provide an update :

  • I was able to reproduce the issue on an AMD instance
  • I modified your steps to run 5 times instead of 100 and saved snapshots from each run. When tried manually to load the snapshots I am able to see the prints which were otherwise not seen through the script. This means the guest takes more time to start in the issue scenario but, the guest is always able to boot if tried manually.
  • In some cases I saw that the prints were seen if a sleep of 5 seconds was added instead of 1 but it was random. We’ll continue investigating this and let you know if we have an update.

Ah OK, I will actually just spin one of those up on AWS then 😃 Thanks

Does Firecracker have any test suites that run against AMD machines? I would be happy to contribute a test that demonstrates the issue.

Yes. m6a.metal is AMD instance that firecracker is tested on. Thanks for your help and effort!