firecracker: Occasional KVM_EXIT_SHUTDOWN and bad syscall (14) during VM shutdown

Firecracker-containerd recently updated a test case that previously ran 5 VMs to now run 100 and started seeing the following errors occur in ~0-3 VMs per test case (issue).

As part of terminating VMs, we tell systemd in the guest to reboot the system. Most of the time this works fine, we see reboot: machine restart in the systemd logs and the VMM exits with 0 as expected right after.

Occasionally though, the guest reaches that reboot: machine restart line and then we see a few logs lines from Firecracker like:

2019-12-06T20:29:30.007995741 [anonymous-instance:INFO] Received KVM_EXIT_SHUTDOWN signal
2019-12-06T20:29:30.008087608 [anonymous-instance:ERROR] Shutting down VM after intercepting a bad syscall (14).

Followed by the VMM exiting with code 148. The KVM_EXIT_SHUTDOWN case and bad syscall (which appears to be rt_sigprocmask) ~seem to only ever occur together (never just one or the other)~ and only right after the reboot call should be occurring.

  • EDIT: I’ve now very rarely seen KVM_EXIT_SHUTDOWN occur without an accompanying rt_sigprocmask call that results in the seccomp violation. When KVM_EXIT_SHUTDOWN occurs by itself, the VMM exits with 0 with no apparent issues.

Additionally, I don’t see this for every VM that exits like that, but am occasionally getting a line like this in dmesg after a run of the test case where failures occur:

[10548769.845897] traps: fc_vcpu0[52462] trap invalid opcode ip:527d36 sp:7f7ee67f0590 error:0 in firecracker[400000+1ab000]

No idea if it’s related or, if it is, why it wouldn’t be written every time a VM exits with KVM_EXIT_SHUTDOWN+the bad syscall, but figured it was worth mentioning.

This same behavior occurs in both 0.19 (w/ rust1.35+musl) and a 0.20 binary I built from here (w/ rust1.38+musl).

Let us know what, if anything, would be helpful for further reproducing or debugging as it seems to be a pretty rare issue. We can run our 100 VM test case with debugging binaries and/or see if it’s possible to make docker container reproducer for you to use or something like that.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (12 by maintainers)

Most upvoted comments

I’m not sure if it’s helpful, but I ran the test case with strace attached to all firecracker processes, the occurrence of the error w/ full rt_sigprocmask args looks like:

ioctl(56, KVM_RUN, 0)                   = 0
ioctl(56, KVM_RUN, 0)                   = 0
ioctl(56, KVM_RUN, 0)                   = 0
write(22, "2019-12-11T01:34:59.969149820 [anonymous-instance:INFO] Received KVM_EXIT_SHUTDOWN signal\n", 90) = 90
write(58, "\1\0\0\0\0\0\0\0", 8)        = 8
close(58)                               = 0
close(56)                               = 0
munmap(0x7f5032e45000, 12288)           = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=8192}, NULL) = 0
munmap(0x7f5032c41000, 8192)            = 0
rt_sigprocmask(SIG_BLOCK, ~[], [QUIT ILL BUS FPE KILL USR2 PIPE TERM CHLD CONT STOP TSTP TTIN TTOU URG], 8) = 14
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0x59dff0, si_syscall=__NR_rt_sigprocmask, si_arch=AUDIT_ARCH_X86_64} ---
write(22, "2019-12-11T01:34:59.969586223 [anonymous-instance:ERROR] Shutting down VM after intercepting a bad syscall (14).\n", 113) = 113
write(24, "{\"utc_timestamp_ms\":19550254,\"api_server\":{\"process_startup_time_us\":0,\"process_startup_time_cpu_us\":0,\"sync_response_fails\":0,\"sync_vmm_send_timeout_count\":0},\"block\":{\"activate_fails\":0,\"cfg_fails\":0,\"event_fails\":0,\"execute_fails\":0,\"invalid_reqs_count\":0,\"flush_count\":0,\"queue_event_count\":0,\"rate_limiter_event_count\":0,\"update_count\":0,\"update_fails\":0,\"read_bytes\":0,\"write_bytes\":0,\"read_count\":0,\"write_count\":0},\"get_api_requests\":{\"instance_info_count\":0,\"instance_info_fails\":0,\"machine_cfg_count\":0,\"machine_cfg_fails\":0},\"i8042\":{\"error_count\":0,\"missed_read_count\":0,\"missed_write_count\":0,\"read_count\":10,\"reset_count\":10,\"write_count\":10},\"logger\":{\"missed_metrics_count\":0,\"metrics_fails\":0,\"missed_log_count\":0,\"log_fails\":0},\"mmds\":{\"rx_accepted\":0,\"rx_accepted_err\":0,\"rx_accepted_unusual\":0,\"rx_bad_eth\":0,\"rx_count\":0,\"tx_bytes\":0,\"tx_count\":0,\"tx_errors\":0,\"tx_frames\":0,\"connections_created\":0,\"connections_destroyed\":0},\"net\":{\"activate_fails\":0,\"cfg_fails\":0,\"event_fails\":0,\"rx_queue_event_count\":0,\"rx_event_rate_limiter_count\":0,\"rx_tap_event_count\":0,\"rx_bytes_count\":0,\"rx_packets_count\":0,\"rx_fails\":0,\"rx_count\":0,\"tx_bytes_count\":0,\"tx_fails\":0,\"tx_count\":0,\"tx_packets_count\":0,\"tx_queue_event_count\":0,\"tx_rate_limiter_event_count\":0,\"tx_spoofed_mac_count\":0},\"patch_api_requests\":{\"drive_count\":0,\"drive_fails\":0,\"network_count\":0,\"network_fails\":0,\"machine_cfg_count\":0,\"machine_cfg_fails\":0},\"put_api_requests\":{\"actions_count\":0,\"actions_fails\":0,\"boot_source_count\":0,\"boot_source_fails\":0,\"drive_count\":0,\"drive_fails\":0,\"logger_count\":0,\"logger_fails\":0,\"machine_cfg_count\":0,\"machine_cfg_fails\":0,\"network_count\":0,\"network_fails\":0},\"rtc\":{\"error_count\":0,\"missed_read_count\":0,\"missed_write_count\":0},\"seccomp\":{\"num_faults\":1},\"vcpu\":{\"exit_io_in\":65,\"exit_io_out\":67,\"exit_mmio_read\":0,\"exit_mmio_write\":0,\"failures\":0,\"filter_cpuid\":0},\"vmm\":{\"device_events\":0,\"panic_count\":0},\"uart\":{\"error_count\":0,\"flush_count\":52,\"missed_read_count\":0,\"missed_write_count\":0,\"read_count\":0,\"write_count\":52},\"memory\":{\"dirty_pages\":0},\"signals\":{\"sigbus\":0,\"sigsegv\":0}}\n", 2111) = 2111
exit_group(148)                         = ?
+++ exited with 148 +++

What’s kind of interesting is that I can also see some strace logs for different VMs showing fc_vcpu0 threads that apparently receive KVM_EXIT_SHUTDOWN but then don’t call rt_sigprocmask, allowing them to exit with 0:

ioctl(56, KVM_RUN, 0)                   = 0
ioctl(56, KVM_RUN, 0)                   = 0
ioctl(56, KVM_RUN, 0)                   = 0
futex(0xb56184, FUTEX_WAIT_PRIVATE, 2147483664, NULL) = 0
write(22, "2019-12-11T01:35:00.658604112 [anonymous-instance:INFO] Received KVM_EXIT_SHUTDOWN signal\n", 90) = 90
write(58, "\1\0\0\0\0\0\0\0", 8)        = 8
close(58)                               = 0
close(56)                               = 0
+++ exited with 0 +++

so I think my previous statement that KVM_EXIT_SHUTDOWN and the seccomp violation always occur together actually isn’t true, very rarely KVM_EXIT_SHUTDOWN will happen without the bad syscall.

Background

On reboot each vcpu will:

  • on x86_64:
    • write CMD_RESET_CPU to the i8042 which will generate a vmm-exit-event.
  • on both x86_64 and aarch64, on more recent host kernels (I need to crawl the linux changelog to see where exactly this was introduced):
    • eventually return from cpu emulation with kvm exit reason KVM_EXIT_SHUTDOWN which also generates a vcpu-exit-event which is interpreted as a vmm-exit-event.

On a vmm-exit-event the vmm will cleanup some things then exit firecracker with exit code 0.

Bug root cause

The problem described in this issue is that firecracker vcpu threads finish their execution thread shortly after issuing the vcpu-exit-event. As part of cleaning up a finished thread musl invokes sigprocmask. Usually the vcpu thread cleanup doesn’t get that far because the vmm thread quickly captures the vmm-exit-event and exits the process cleanly, but on a system with high load, the vmm thread might be scheduled out for a while and this bug in the firecracker code is exposed.

Proposed solution

Vcpu main loop should not end after issuing vcpu-exit-event, instead it should move to the Paused state and wait for further commands from Vmm. This will allow the Vmm thread full control on what should happen next, and will also fix the incorrect behavior reported in this issue.

KVM_EXIT_SHUTDOWN is not something we need to whitelist. It’s just a type of VcpuExit, and we handle it in the vcpu’s emulation loop.

The weird thing is this:

What’s kind of interesting is that I can also see some strace logs for different VMs showing fc_vcpu0 threads that apparently receive KVM_EXIT_SHUTDOWN but then don’t call rt_sigprocmask, allowing them to exit with 0:

rt_sigprocmask is called by libc, originating in pthread_exit. See https://github.com/firecracker-microvm/firecracker/issues/1088#issuecomment-539974241

After more digging, looks like the rt_sigprocmask is unescapable. It’s also called from __block_all_sigs, which is in turn called by pthread_exit. Basically musl took every precaution and blocks signals whenever something signal-unsafe is underway.

So, when the vcpu threads exit, they should be calling sigprocmask and causing a SIGSYS.

@sipsma this looks like a bug on our side. We should gracefully handle KVM_EXIT_SHUTDOWN, thanks for the deep-dive and help on finding the root cause 👍

Do you need to send that command from the guest? Why not use the CTR+ALT+DEL functionality instead?

We decided to go with just having the guest init system do a reboot because ctrl-alt-del doesn’t work on all kernels and isn’t supported on aarch64 (which we’d like to support someday), so avoiding it keeps our dependencies to a minimum.