kubevirt: VMs cannot reboot when emulation is used

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

With useEmulation: true, VMs are unable to soft reboot.

What you expected to happen:

It should be slow, but they should eventually reboot.

How to reproduce it (as minimally and precisely as possible):

Deploy KubeVirt with:

---
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  certificateRotateStrategy: {}
  configuration:
    developerConfiguration:
      featureGates: []
      useEmulation: true
  customizeComponents: {}
  imagePullPolicy: IfNotPresent
  workloadUpdateStrategy: {}

Create a cirros VM (although it is reproducible with Fedora too):

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: vm-cirros
  name: vm-cirros
spec:
  running: true
  template:
    metadata:
      labels:
        kubevirt.io/vm: vm-cirros
    spec:
      domain:
        devices:
          interfaces:
          - masquerade: {}
            name: default
          disks:
          - disk:
              bus: virtio
            name: containerdisk
          - disk:
              bus: virtio
            name: cloudinitdisk
        resources:
          requests:
            memory: 2Gi
      terminationGracePeriodSeconds: 0
      networks:
      - name: default
        pod: {}
      volumes:
      - containerDisk:
          image: quay.io/kubevirt/cirros-container-disk-demo:latest
        name: containerdisk
      - cloudInitNoCloud:
          userData: |
            #!/bin/sh
            echo 'printed from cloud-init userdata'
        name: cloudinitdisk

Connect using console and reboot:

virtctl console vm-cirros
reboot

It turns down but never turns up again:

The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system reboot
[  215.325881] reboot: Restarting system
[  215.326396] reboot: machine restart

Environment:

  • KubeVirt version (use virtctl version): [kubevirt] virtctl version Client Version: version.Info{GitVersion:"v0.43.0", GitCommit:"7c7a2f4ace9ce3a88b164d4d282db55f08b6dc5e", GitTreeState:"clean", BuildDate:"2021-07-09T15:54:26Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{GitVersion:"v0.41.4-dirty", GitCommit:"0b3c97aed8051d2985b3ee2dee944ffa656822bc", GitTreeState:"dirty", BuildDate:"2021-10-20T07:29:44Z", GoVersion:"go1.13.14", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v0.21.0-beta.1", GitCommit:"96e95cef877ba04872b88e4e2597eabb0174d182", GitTreeState:"clean", BuildDate:"2021-09-10T13:09:35Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.17", GitCommit:"68b4e26caf6ede7af577db4af62fb405b4dd47e6", GitTreeState:"clean", BuildDate:"2021-03-18T00:54:02Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (0.21) and server (1.18) exceeds the supported minor version skew of +/-1
  • VM or VMI specifications: In description above
  • Others: using kubevirtci on KubeVirt’s release-0.41

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 21 (18 by maintainers)

Most upvoted comments

I did some testing and it looks like rebooting emulated VMs last worked in KUBEVIRT_VERSION=v0.35.0 which used QEMU version 4.2.0 (qemu-kvm-4.2.0-27.fc31) and it stopped working in v0.36.0 which updated QEMU to 5.1.0 (qemu-kvm-5.1.0-16.fc32). When it was working in v0.35 it still hung for about 35 seconds before finally continuing:

$ ./cluster-up/virtctl.sh console vm-cirros
selecting docker as container runtime
Successfully connected to vm-cirros console. The escape sequence is ^]

login as 'cirros' user. default password: 'gocubsgo'. use 'sudo' for root.
vm-cirros login: cirros
Password: 
$ sudo reboot
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system reboot
[   81.296397] reboot: Restarting system
[   81.296677] reboot: machine restart    <-------------35 second pause here before restarting
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct

After upgrading to v0.36 and using virtctl to stop/start the VMI (to recreate the pod), the reboot command stops working and results in the qemu-kvm process pegging a CPU core on the node. Even trying to use virsh commands inside the compute container causes the same result:

# Connect to VM console to see output
# then run virsh commands from inside the compute pod
$ ./cluster-up/kubectl.sh get pods -A -l kubevirt.io=virt-launcher,kubevirt.io/vm=vm-cirros -o name
virt-launcher-vm-cirros-r4xkr 
$ ./cluster-up/kubectl.sh exec -it virt-launcher-vm-cirros-r4xkr -- /bin/bash
selecting docker as container runtime
Defaulting container name to compute.
Use 'kubectl describe pod/virt-launcher-vm-cirros-r4xkr -n default' to see all of the containers in this pod.
[root@vm-cirros /]# virsh list
 Id   Name                State
-----------------------------------
 1    default_vm-cirros   running

[root@vm-cirros /]# virsh reboot default_vm-cirros
Domain default_vm-cirros is being rebooted

# On the console you should see:
login as 'cirros' user. default password: 'gocubsgo'. use 'sudo' for root.
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system poweroff
[  156.661090] reboot: Power down      <-- Not sure why this doesn't say restart, but regardless it hangs here

When checking QEMU issues I did see one that looked somewhat similar (reboot loop for emulated q35 machines), so I tried to reproduce the issue outside of kubevirt:

# Debian 11 testing system
qemu-system-x86_64 -h
QEMU emulator version 6.1.0 (Debian 1:6.1+dfsg-8+build1)

cd /tmp
wget https://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
sudo virt-install --connect=qemu:///system --name=cirros-emu --ram=512 --vcpus=1 \
  --disk path=cirros-0.4.0-x86_64-disk.img,format=qcow2 --import --graphics none --nonetworks \
  --virt-type=qemu --machine q35

# Cleanup when finished
virsh destroy cirros-emu
virsh undefine cirros-emu

which exhibits the same behavior when trying to reboot the VM (and shows a similar increase in QEMU CPU usage). If you omit the --virt-type=qemu argument it reboots with no delay (using kvm instead of emulation) and if you omit just the --machine q35 arguments it creates a qemu process with -machine pc-i440fx-6.1 instead of -machine pc-q35-6.1 and also works as expected (although with a 65 second delay instead of 35 seconds).

So this seems like this could be a long lived regression in QEMU emulation. Using a different machine type would be a workaround, but looks like kubevirt only allows using q35.