kubernetes: User can easily crash an entire cluster by creating pod with memory-backed EmptyDir

What happened?

The following phenomena were observed:

  • Pods that mounts a memory-backed EmptyDir with a default size (allocatable) was created on the node, and the app ran that quickly consumes this EmptyDir.
  • Since it is not counted as memory usage unless EmptyDir is actually consumed, multiple such Pods can be created on the same node.
  • Naturally, multiple tmpfs with approximately the same capacity as the node’s memory size are also created.
    • This causes memory overcommited situation.
  • Multiple EmptyDirs were consumed fast enough, then K8s tried to delete Pods and EmptyDirs due to excessive memory usage.
    • Pod status becomes OOMKilled
    • Other attempts to deploy pods (nginx) are not scheduled
  • The deletion of tmpfs could not do in time, so the moment of memory exhaustion came and the node crashed.
    • If the number of replicas in a Deployment is set to be sufficiently higher, all schedulable nodes crashed.
    • Pod status becomes ContainerStatusUnknown
    • Pod status is Terminating but not deleted
    • The memory backed tmpfs on node is no longer deleted
    • The node becomes NotReady and you can’t even log in with SSH
    • This could be a security issue…

Operations and logs

  • kubelet log (deployment-memvol-replica-6.log)
    • 10:08 Worker VMs were restarted
    • Deploy memvol deployment with 6 replicas.
    • Each worker has two pods deployed
    • The Pods kept the Running state for about 5 minutes , although they consumed most of the memory and CPU.
    • The node seems to be NotReady, so I deleted the deployment.
    • Stopped the worker VMs because the Pod remained in Terminating state.
    • Start worker VMs again.

What did you expect to happen?

  • Memory usage by EmptyDir should be strictry managed by k8s
    • Memory overcommit should be allowed by cluster admin, not by default.
    • Or enable to be controlled the total size of memory-backed EmptyDir per worker node.
    • Add taint to node for that memory-overcommit-allowed or memory-backed-emptydir-allowed.
  • The default size of memory-backed EmptiDir should be smaller size that is less likely to cause problems.
    • e.g. 50Mi.

I hope that the cluster admins can prevent such situations where this kind of phenomenon can easily occur, whether due to a mistake or malicious intent, then the cluster admins can safely provide RAM disk to user.

How can we reproduce it (as minimally and precisely as possible)?

Deploy following deployment with that replicas has double number of nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: memvol
  name: memvol
spec:
  replicas: 6
  selector:
    matchLabels:
      app: memvol
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: memvol
    spec:
      containers:
      - image: busybox
        name: busybox
        command: ["sh"]
        args:
        - "-c"
        - "sleep 5 && dd if=/dev/zero of=/memvol/dump bs=1M count=4000"
        volumeMounts:
        - name: memvol
          mountPath: /memvol
      volumes:
      - name: memvol
        emptyDir:
          medium: Memory

Anything else we need to know?

The point would be how k8s manages memory usage especially in such case of overcommitted memory . In such cases, the kube-reserved and system-reserved settings are not so effective.

Current workaround and its problems

  • Limiting user namespace memory usage using ResourceQuota seems to meet the requirements of a cluster admin. However, I think it would be difficult to get users to accept this, as it would require them to add memory usage information to all manifests.
  • I understand that this can be achieved by adding an admission webhook on your own, but I think it would be better if K8s supported it.

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:14:49Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

$ libvirtd --version
libvirtd (libvirt) 8.0.0

control plane * 1: 2CPU, 4GB memory, 25GB disk worker * 3: 2CPU, 4GB memory, 25GB disk

OS version

# On Linux:
$ cat /etc/os-release
# PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux master01 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

</details>


### Install tools

<details>

```console
# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:19:40Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

Container runtime (CRI) and version (if applicable)

root@master01:~# crictl --version
crictl version 1.26.0
root@master01:~# containerd --version
containerd github.com/containerd/containerd v1.7.2 0cae528dd6cb557f7201036e9f43420650207b58

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 28 (27 by maintainers)

Most upvoted comments

You can specify a sizeLimit on the volume, see https://kubernetes.io/docs/concepts/storage/volumes/#emptydir-configuration-example which will limit the amount of memory that is consumed.

Even without the limits I couldn’t exactly reproduce your scenario. In my case, the containers would get OOMKilled and then the pod would be evicted due to memory pressure.

Can you share the output of kubectl describe no?

@shu-mutou

the discussion here is fine of course, but i think you can also try joining the sig storage meeting and get direct feedback from more folks. or if it’s a bad time zone for you, you can message them in #sig-storage on k8s slack. https://github.com/kubernetes/community/tree/master/sig-storage

Can you supply kubelet logs for when this occurs?

/triage accepted /cc tzneal