kubevirt: OOM crashes in v0.20.1

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened:

Using 0.20.1 on Azure, all VMIs failed. The volumecontainerdisk container was hitting the OOM killer 100% of the time.

What you expected to happen:

VMIs to start!

How to reproduce it (as minimally and precisely as possible):

Take a vanilla ubuntu 18.04 image, and create a VMI with a minimal config. It will fail if you are running in Azure (AKS). Dockerfiles and VMI spec included below for reference.

Anything else we need to know?:

  • Problem is specific to Azure; not seen elsewhere. Not a clue why, frankly, though if we are just running out of memory there are various possibilities.

  • Problem is specific to v0.20.1; does not repro in v0.19.

  • Problem manifests with either scratch containers or those based on kubevirt/container-disk-v1alpha

  • If you create a VMI, then copy and hack the pod spec to remove the resource limits, then the OOM killer does not kick in; however the compute pod doesn’t work either. This is not a supported thing to do but is suggestive that the underlying problem can be resolved with different limits for the volumecontainerimage pod.

Environment:

  • KubeVirt version (use virtctl version): v0.20.1
  • Kubernetes version (use kubectl version): 1.13.9
  • VM or VMI specifications: See below
  • Cloud provider or hardware configuration: Azure AKS, using nodes with 16/32 cores and 64/132GB RAM
  • OS (e.g. from /etc/os-release): Ubuntu 16.04 as installed by AKS
  • Kernel (e.g. uname -a): Don’t know - getting access to the boxes is a pain, but could find out if it helps.
  • Install tools: AKS

Dockerfile (using completely vanilla downloaded ubuntu image)

FROM kubevirt/container-disk-v1alpha
# qcow2
ADD bionic-server-cloudimg-amd64.img /disk

Also tried this Dockerfile:

FROM scratch
# qcow2
ADD bionic-server-cloudimg-amd64.img /disk

VMI spec I used

apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
  name: ubuntuvmi
  labels:
    billy: bob
spec:
  domain:
    resources:
      requests:
        memory: 1024M
        cpu: 2
    devices:
      disks:
      - name: containerdisk
        disk: {}
      - name: cloudinitdisk
        disk: {}
  volumes:
  - name: containerdisk
    containerDisk:
      image: metaswitchglobal.azurecr.io/plw/ubuntu:1804
  - name: cloudinitdisk
    cloudInitNoCloud:
      userData: |-
        #cloud-config
        password: ubuntu
        chpasswd: { expire: False }
        ssh_pwauth: True

I’ve discussed on slack with @slintes, @fabiand and @rmohr. Many thanks to all of you for your help.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (15 by maintainers)

Most upvoted comments

We saw this in 0.21.0. Was resolved after upgrading to 0.22.0.

Thanks - let me have a shot at that. I should get to that later in the week.