kubernetes: Kubelet memory leak and "top nodes" showing UNKOWN, "crictl stats" stuck

What happened?

Kubelet memory increasing until the worker is completely out of memory as the OOM killer avoids killing kubelet due to its high priority. The worker eventually goes into NotReady state and in many cases is completely unreachable due to key processes being killed or not having enough memory for basic operation (e.g. ssh).

Other observations:

kubectl top node fails to display metrics for affected node, long before kubelet memory increases to a high consumption

kubectl logs -n kube-system metrics-server-5d96bcd897-rh6w8 shows: “Failed to scrape node” err=“Get "https://214.6.10.222:10250/stats/summary?only_cpu_and_memory=true\”: context deadline exceeded" node=“flex-sc-cdd-aat-3608-worker8”

Containerd memory consumption also increases, although to a lesser amount compared to kubelet. The issues seems to be related to containerd and kubelet interaction
When kubelet is manually killed we see hundreds of thousands of hung goroutines logged in syslog, many involved in metrics requests made to kubelet. Containerd also logs many canceled requested to kubelet that were hung: level=error msg=“collecting metrics for …” error=“context canceled”
Killing/restarting kubelet clears memory consumption but not the issue so it will continue to grow again. Metrics are still not visible. This points to a containerd issue as the trigger, but kubelet should be able to handle this without killing the worker node.
The issue is cleared when containerd parent process is restarted
Container check tasks is marked UNKNOWN on the worker node: ctr --namespace k8s.io task ls

flex-sc-cdd-aat-3608-master1:~> kubectl top node
NAME                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
flex-sc-cdd-aat-3608-master1   1412m        35%    4729Mi          60%
flex-sc-cdd-aat-3608-master2   1419m        35%    4666Mi          59%
flex-sc-cdd-aat-3608-master3   1396m        34%    4759Mi          60%
flex-sc-cdd-aat-3608-worker1   2800m        17%    14503Mi         22%
flex-sc-cdd-aat-3608-worker2   1461m        9%     13438Mi         20%
flex-sc-cdd-aat-3608-worker3   1344m        8%     9975Mi          15%
flex-sc-cdd-aat-3608-worker4   954m         5%     13246Mi         20%
flex-sc-cdd-aat-3608-worker5   940m         5%     11475Mi         17%
flex-sc-cdd-aat-3608-worker6   980m         6%     14511Mi         22%
flex-sc-cdd-aat-3608-worker7   1962m        12%    15616Mi         24%
flex-sc-cdd-aat-3608-worker8   <unknown>                           <unknown>               <unknown>               <unknown>

The issue seems to be something to do with interaction between the kubelet and containerd.

What did you expect to happen?

kubectl top node should not show unknown
crictl stats should not get stuck
Node OOM should not happen without restarting kubelet/containerd

How can we reproduce it (as minimally and precisely as possible)?

The issue is not easily reproducible. It is happening once in a while, random.

Anything else we need to know?

There are two separate issues here, one with kubelet stats requests/collection and the second with containerd getting into a state where there is a phantom container left behind which blocks certain operations.

The trigger seems to be containerd failing to stop a container of a pod running DocumentDB (Postgres) as part of the user’s application. The container is actually gone, but its reference still exists in containerd memory. I suspect a timeout issue related to something odd this pod is doing.

Containerd with two phantom containers with PID 0 and status UNKNOWN:

orch-do-eoom3-7123-worker14:~ # ctr --namespace k8s.io task ls
TASK                                                                PID       STATUS
079cb9d4d798da82e8e2525fedb6f3a9413de1c191a6f9c87e46a31f25f38850    0         UNKNOWN
bd939da92cbc0b0eb56247817a75bc5f3082fd56628799b350e808ab9ec798a2    0         UNKNOWN

crictl shows the containers as running:

orch-do-eoom3-7123-worker14:~ # crictl ps | grep -E '(079cb9d4d7|bd939da92c)'
bd939da92cbc0       6f0ed8a6419fb       7 weeks ago         Running             eric-fh-alarm-handler-db-pg    0                   dc4f2fd58f4c8
079cb9d4d798d       6f0ed8a6419fb       7 weeks ago         Running             eric-sec-access-mgmt-db-pg     0                   fb3c5f672cbda

But if we check PIDs they are actually not running:

orch-do-eoom3-7123-worker14:~ # crictl inspect bd939da92cbc0b0eb56247817a75bc5f3082fd56628799b350e808ab9ec798a2  | grep pid
    "pid": 83055,

orch-do-eoom3-7123-worker14:~ # crictl inspect 079cb9d4d798da82e8e2525fedb6f3a9413de1c191a6f9c87e46a31f25f38850 | grep pid
    "pid": 82218,

orch-do-eoom3-7123-worker14:~ # ps -up 83055
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
orch-do-eoom3-7123-worker14:~ # ps -up 82218
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

The crictl stats command hangs forever:

orch-do-eoom3-7123-worker14:~ # crictl stats
^C

The K8s metrics-server pod in the kube-system namespace periodically (every 15s) queries kubelet on every node to get memory and CPU stats. Kubelet in turn depends on containerd to get container stats. When containerd is in this state, kubelet keeps spawning goroutines with each incoming request which increases memory consumption indefinitely:

orch-do-eoom3-7123-master1:~> kubectl logs -n kube-system metrics-server-5d96bcd897-rh6w8 | less
E1211 23:43:21.904029       1 scraper.go:139] "Failed to scrape node" err="Get \"https://214.6.10.214:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="orch-do-eoom3-7123-worker14"
E1211 23:43:36.905091       1 scraper.go:139] "Failed to scrape node" err="Get \"https://214.6.10.214:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="orch-do-eoom3-7123-worker14"

If we kill kubelet in this state we see hundreds of thousands of goroutines closed (logged to syslog/journald), many are responsible the above stats requests and were waiting for many days.

orch-do-eoom2-7106-worker23:~ # grep -c  goroutine /var/log/messages
199464

A non-disruptive workaround is to send a SIGTERM signal to the parent containerd process. This restarts the parent containerd process without affecting containers running on the worker node. This clears the phantom containers from containerd memory. The kubelet process doesn’t need to be restarted, Golang has auto garbage collection which slowly reduces kubelet memory consumption over about an hour. If the worker is in a critical state kubelet can be restarted directly to quickly recover memory usage.

Find the parent containerd PID and gracefully terminate, Systemd will automatically restart the process:

orch-do-eoom3-7123-worker14:~> sudo systemctl status containerd | grep 'Main PID'
 Main PID: 49955 (containerd)
orch-do-eoom3-7123-worker14:~> sudo kill -SIGTERM 49955

We see kubelet goroutines related to metrics server request are finally closed with an error, in this case there were 38,781 running which is likely where all the memory was being consumed:

orch-do-eoom3-7123-worker14:~> sudo tail /var/log/messages
2022-12-14T13:01:16.491189+00:00 orch-do-eoom3-7123-worker14 kubelet[63904]: E1214 14:01:11.856946   63904 handler.go:182] "HTTP InternalServerError serving" err="failed to list pod stats: failed to list all container stats: rpc error: code = Unavailable desc = error reading from server: EOF" request="/stats/summary"
2022-12-14T13:01:16.491203+00:00 orch-do-eoom3-7123-worker14 kubelet[63904]: E1214 14:01:11.859349   63904 log_metrics.go:66] "Failed to get pod stats" err="failed to list all container stats: rpc error: code = Unavailable desc = error reading from server: EOF"

orch-do-eoom3-7123-worker14:~> sudo grep -c "Failed to get pod stats" /var/log/messages
38781

Now that they are closed, the memory can be garbage-collected and released over time.

Kubernetes version

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.3", GitCommit:"c37ac32c86e6cc346a92f9aeb62a0620c84ae047", GitTreeState:"clean", BuildDate:"2022-10-17T04:35:01Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.3", GitCommit:"c37ac32c86e6cc346a92f9aeb62a0620c84ae047", GitTreeState:"clean", BuildDate:"2022-10-17T04:31:37Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"linux/amd64"}```

</details>


### Cloud provider

<details>
Openstack
</details>


### OS version

_No response_

### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

_No response_

### Related plugins (CNI, CSI, ...) and versions (if applicable)

_No response_

About this issue

Original URL
State: open
Created a year ago
Comments: 20 (9 by maintainers)

Most upvoted comments

pinging @SergeyKanzhelev and @mrunalp for the approval of https://github.com/kubernetes/kubernetes/pull/120262

harche on Sep 20, 2023