kubernetes: Removing terminated pod's Cgroup

For reference: For work in #27204 which is introducing pod level cgroups into Kubernetes. As a part of which we create cgroups for each pod and each of the pod’s containers are brought under its pod cgroup.

We need to cleanup the pod’s cgroup once the pod is terminated. But we need to be careful about how we delete the cgroup.

Some Points to consider:

Once the pod is terminated we can reduce it’s cgroup’s CpuShares to ensure that it doesn’t take up significant CPU resources.
We cannot set memory limit to 0 as there might be anonymous pages (tmpfs) that cannot be swapped out.
Hence for memory, we need to ensure that all the pages are reclaimed before deleting the pod’s cgroups, otherwise they would be simply moved to its parent cgroup which is highly undesirable and causes system memory pressure.
We should delete the pod’s cgroup only after the pod’s volumes have been cleaned up because of the same reason as above.
We need to ensure that no processes are attached to the cgroup. If any processes are found, they must be killed in-order to be able to delete the cgroup.

On @vishh suggestions we were able to come up with the following logic for destroying pod cgroups.

Once the pod is terminated we reduce it’s cgroup’s CpuShares to 2(the lowest possible value).
Once a pod’s volumes have been successfully cleaned up we spawn a separate go routine to destroy all the unwanted pod’s cgroups.
The go routine is per pod cgroup and will run until the cgroups are successfully deleted.
As part of the goroutine we would attempt the following steps.
- SIGKILL all the processes attached to cgroup. We repeatedly attempt killing the processes until success. If unsuccessful for 10 iterations we start logging failure messages and generate events.
- Use cgroup’s memory.force_empty interface to force the kernel to reclaim all pages.
We repeatedly reduce the cgroups memory usage until it falls below a small threshold(0.1 Mb seems to be a reasonable value).
In each iteration, if the current memory usage is x bytes we first limit the cgroup memory usage by writing x to memory.limit_in_bytes, then we write 0 to memory.force_empty to further reduce the memory usage, and we continually do this until the memory usage falls below the threshold.

Thanks to @vishh for the suggestions. This should provide a really safe method of deleting cgroups.

To implement this we would first need to add support for force_empty in libcontainer as it currently doesn’t support force empty.

Once we finalize the logic here I would be adding this note to the proposal. @derekwaynecarr @Random-Liu @vishh PTAL. Please cc anyone else who can provide feedback.

cc @kubernetes/sig-node

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 23 (8 by maintainers)

Most upvoted comments

@dubstack This is an important issue as it makes cadvisor report tons of unnecessary information after a while (in my case cadvisor report reached 2.6MB!. It also makes housekeeping very costly. Can you please re-open this issue?

michelgokan on Mar 30, 2019

I took the liberty of editing the proposal directly. LGTM. Thanks for the write up!

vishh on Aug 2, 2016