kubernetes: Ephemeral storage doesn't account for deleted files with open handles

What happened:

A pod that creates large files in the ephemeral storage (either in an explicit emptyDir volume or just by writing to the container’s filesystem) and then deletes them while keeping an open file handle will cause serious stability issues on the node while making it very hard for an operator to find out that it’s the root cause. From kubelet’s perspective the pod’s ephemeral-storage usage will be close to zero, so it’ll go and evict other pods first while the offending pod continues using more and more disk space. None of the reported metrics (e.g. cadvisor’s container_fs_usage_bytes) will report this pod’s usage correctly, so the only way to even find it is manually running lsof -a +L1 on the node. This also prevents Ephemeral Storage limits from working correctly, making it impossible for cluster operators to use those to protect nodes from faulty applications.

What you expected to happen:

A pod’s filesystem usage is accounted for correctly, even if disk space is used by deleted files. When the node runs out of disk space, the offending pod is evicted.

How to reproduce it (as minimally and precisely as possible):

Deploy a pod that’ll create large files, delete them while still keeping the handle open and writing no logs or any other data. Wait until the node runs out of disk space. Observe that kubelet tries to evict all other pods instead (since they’ll consume non-zero space for their logs), then gives up and the node dies completely.

Environment:

  • Kubernetes version (use kubectl version): 1.14.6
  • Cloud provider or hardware configuration: kubernetes-on-aws
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04.3 LTS
  • Kernel (e.g. uname -a): 4.15.0-1048-aws

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 8
  • Comments: 33 (17 by maintainers)

Most upvoted comments

I created PoC for this bug https://github.com/Ottovsky/keep-open-deleted , I am surprised how important the impact it can have on the node, the pods are actually randomly evicted from it.

See https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20180906-quotas-for-ephemeral-storage.md (which is currently in alpha). This uses XFS quotas (which also work on suitably enabled ext4fs filesystems) to monitor storage for emptydir volumes. I’d like to graduate this to beta, if for no other reason than this.

Extending this beyond emptydir volumes will probably require similar changes to cAdvisor.