gardenlinux: K8s Cluster Nodes with GL 318.4 getting stuck/NotReady
What happened:
Running a node with GL 318.4 often gets stuck (as in becoming NotReady) when a certain amount of idling pods are running on it.
Nodes with GL 184.0.0 and the same load/amount-of-pods don’t exhibit this behaviour.
When a node becomes NotReady we couldn’t ssh into the node either and existing console session were stalling.
From K8s point of view we noticed various events, most prominently Kubelet stopped posting node status, but sometimes also from some “kernel-monitor”:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning TaskHung 5m23s kernel-monitor INFO: task kswapd0:49 blocked for more than 120 seconds.
Warning TaskHung 5m22s kernel-monitor INFO: task jbd2/sda4-8:363 blocked for more than 121 seconds.
Warning TaskHung 5m21s kernel-monitor INFO: task dockerd:6500 blocked for more than 121 seconds.
Warning TaskHung 5m19s kernel-monitor INFO: task dockerd:9030 blocked for more than 122 seconds.
Warning TaskHung 5m17s kernel-monitor INFO: task dockerd:9551 blocked for more than 122 seconds.
Warning TaskHung 5m15s kernel-monitor INFO: task containerd:7378 blocked for more than 122 seconds.
Warning TaskHung 5m12s kernel-monitor INFO: task containerd:8414 blocked for more than 123 seconds.
Warning TaskHung 5m9s kernel-monitor INFO: task kubelet:3979 blocked for more than 123 seconds.
Warning TaskHung 5m6s kernel-monitor INFO: task kubelet:4211 blocked for more than 124 seconds.
Warning TaskHung 5m5s kernel-monitor (combined from similar events): INFO: task kubelet:6670 blocked for more than 124 seconds.
What you expected to happen:
GL 318.4 to not perform significantly worse than GL 184
How to reproduce it (as minimally and precisely as possible):
- Create a gardener cluster with two worker pools, one with
318.4and one with184. - Deploy a simple
Deploymentwith anti-affinity:
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: some-load
spec:
replicas: 1
selector:
matchLabels:
name: load
template:
metadata:
labels:
name: load
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "kubernetes.io/hostname"
# preferredDuringSchedulingIgnoredDuringExecution:
# - podAffinityTerm:
# topologyKey: "kubernetes.io/hostname"
# weight: 100
containers:
- name: load
image: nicolaka/netshoot
command:
- bash
args:
- -c
- |
sleep 3153600000
resources:
limits:
cpu: 1000m
memory: 100Mi
requests:
cpu: 10m
memory: 10Mi
terminationGracePeriodSeconds: 1
- Scale the deployment, I could see kubelet/node flapping with about 70 pods on each worker pool (140 pods in total), e.g.
k scale deployment some-load --replicas 140
cc @vpnachev
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (17 by maintainers)
Commits related to this issue
- Address several runc issues and security updates Security vulnerabilities: - runc: https://security-tracker.debian.org/tracker/CVE-2021-30465 - bind9: https://security-tracker.debian.org/tracker/CVE-... — committed to gardenlinux/gardenlinux by marwinski 3 years ago
- Address several runc issues and security updates Security vulnerabilities: - runc: https://security-tracker.debian.org/tracker/CVE-2021-30465 - bind9: https://security-tracker.debian.org/tracker/CVE-... — committed to gardenlinux/gardenlinux by marwinski 3 years ago
- Address several runc issues and security updates Security vulnerabilities: - runc: https://security-tracker.debian.org/tracker/CVE-2021-30465 - bind9: https://security-tracker.debian.org/tracker/CVE-... — committed to gardenlinux/gardenlinux by marwinski 3 years ago
TL;DR - image build from main (4c7a19c5eede7ebd196426b947721185c89e8d31 on May 26 2021) is working fine and the above problem is not observed - I managed to create a deployment with 200 replicas without any issues.
Some more details
We are not alone with this problem. Similar difficulties are reported by others with recent versions of OS/Docker/Containerd/runC, ref:
Basically, the issue is rooted in the runC and how PLEG is using
docker pswhich is timing out, I cannot explain it better than https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-823885293.Related containerd/runc issues:
Looking into these packages versions, we see that building from main branch today provides upgrades for all packages and fix for the above issue is also included.
Frankly, I wouldn’t know why to continue with 318.4, so unless we hit a wall with 421.0/the latest version, I would always strive for that one and go back only if that wall is too high.
I vote for building new major version and stop investing into 318.X exactly because a lot of packages received updates which should include security and bug fixes. Also, 318 is based on debian testing state since 100+ days which is quite old. There are some issues with the build currently, for example https://github.com/gardenlinux/gardenlinux/issues/253, but this out of scope of this issue.
While educating my self on high and low level CRIs, I have tested one more config option - containerd+runc. It is working reliably and I do not observe the issue when creating a deployment with 200 replicas.