kubevirt: OOM crashes in v0.20.1
Is this a BUG REPORT or FEATURE REQUEST?: /kind bug
What happened:
Using 0.20.1 on Azure, all VMIs failed. The volumecontainerdisk container was hitting the OOM killer 100% of the time.
What you expected to happen:
VMIs to start!
How to reproduce it (as minimally and precisely as possible):
Take a vanilla ubuntu 18.04 image, and create a VMI with a minimal config. It will fail if you are running in Azure (AKS). Dockerfiles and VMI spec included below for reference.
Anything else we need to know?:
-
Problem is specific to Azure; not seen elsewhere. Not a clue why, frankly, though if we are just running out of memory there are various possibilities.
-
Problem is specific to v0.20.1; does not repro in v0.19.
-
Problem manifests with either scratch containers or those based on
kubevirt/container-disk-v1alpha -
If you create a VMI, then copy and hack the pod spec to remove the resource limits, then the OOM killer does not kick in; however the compute pod doesn’t work either. This is not a supported thing to do but is suggestive that the underlying problem can be resolved with different limits for the volumecontainerimage pod.
Environment:
- KubeVirt version (use
virtctl version): v0.20.1 - Kubernetes version (use
kubectl version): 1.13.9 - VM or VMI specifications: See below
- Cloud provider or hardware configuration: Azure AKS, using nodes with 16/32 cores and 64/132GB RAM
- OS (e.g. from /etc/os-release): Ubuntu 16.04 as installed by AKS
- Kernel (e.g.
uname -a): Don’t know - getting access to the boxes is a pain, but could find out if it helps. - Install tools: AKS
Dockerfile (using completely vanilla downloaded ubuntu image)
FROM kubevirt/container-disk-v1alpha
# qcow2
ADD bionic-server-cloudimg-amd64.img /disk
Also tried this Dockerfile:
FROM scratch
# qcow2
ADD bionic-server-cloudimg-amd64.img /disk
VMI spec I used
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
name: ubuntuvmi
labels:
billy: bob
spec:
domain:
resources:
requests:
memory: 1024M
cpu: 2
devices:
disks:
- name: containerdisk
disk: {}
- name: cloudinitdisk
disk: {}
volumes:
- name: containerdisk
containerDisk:
image: metaswitchglobal.azurecr.io/plw/ubuntu:1804
- name: cloudinitdisk
cloudInitNoCloud:
userData: |-
#cloud-config
password: ubuntu
chpasswd: { expire: False }
ssh_pwauth: True
I’ve discussed on slack with @slintes, @fabiand and @rmohr. Many thanks to all of you for your help.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (15 by maintainers)
We saw this in 0.21.0. Was resolved after upgrading to 0.22.0.
Thanks - let me have a shot at that. I should get to that later in the week.