gardenlinux: K8s Cluster Nodes with GL 318.4 getting stuck/NotReady

What happened: Running a node with GL 318.4 often gets stuck (as in becoming NotReady) when a certain amount of idling pods are running on it. Nodes with GL 184.0.0 and the same load/amount-of-pods don’t exhibit this behaviour.

When a node becomes NotReady we couldn’t ssh into the node either and existing console session were stalling. From K8s point of view we noticed various events, most prominently Kubelet stopped posting node status, but sometimes also from some “kernel-monitor”:

Events:
  Type     Reason    Age    From            Message
  ----     ------    ----   ----            -------
  Warning  TaskHung  5m23s  kernel-monitor  INFO: task kswapd0:49 blocked for more than 120 seconds.
  Warning  TaskHung  5m22s  kernel-monitor  INFO: task jbd2/sda4-8:363 blocked for more than 121 seconds.
  Warning  TaskHung  5m21s  kernel-monitor  INFO: task dockerd:6500 blocked for more than 121 seconds.
  Warning  TaskHung  5m19s  kernel-monitor  INFO: task dockerd:9030 blocked for more than 122 seconds.
  Warning  TaskHung  5m17s  kernel-monitor  INFO: task dockerd:9551 blocked for more than 122 seconds.
  Warning  TaskHung  5m15s  kernel-monitor  INFO: task containerd:7378 blocked for more than 122 seconds.
  Warning  TaskHung  5m12s  kernel-monitor  INFO: task containerd:8414 blocked for more than 123 seconds.
  Warning  TaskHung  5m9s   kernel-monitor  INFO: task kubelet:3979 blocked for more than 123 seconds.
  Warning  TaskHung  5m6s   kernel-monitor  INFO: task kubelet:4211 blocked for more than 124 seconds.
  Warning  TaskHung  5m5s   kernel-monitor  (combined from similar events): INFO: task kubelet:6670 blocked for more than 124 seconds.

What you expected to happen: GL 318.4 to not perform significantly worse than GL 184

How to reproduce it (as minimally and precisely as possible):

  • Create a gardener cluster with two worker pools, one with 318.4 and one with 184.
  • Deploy a simple Deployment with anti-affinity:
deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: some-load
spec:
  replicas: 1
  selector:
    matchLabels:
      name: load
  template:
    metadata:
      labels:
        name: load
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: "kubernetes.io/hostname"
          # preferredDuringSchedulingIgnoredDuringExecution:
          # - podAffinityTerm:
          #     topologyKey: "kubernetes.io/hostname"
          #   weight: 100
      containers:
      - name: load
        image: nicolaka/netshoot
        command:
        - bash
        args:
          - -c
          - |
            sleep 3153600000
        resources:
          limits:
            cpu: 1000m
            memory: 100Mi
          requests:
            cpu: 10m
            memory: 10Mi
      terminationGracePeriodSeconds: 1

  • Scale the deployment, I could see kubelet/node flapping with about 70 pods on each worker pool (140 pods in total), e.g. k scale deployment some-load --replicas 140

cc @vpnachev

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (17 by maintainers)

Commits related to this issue

Most upvoted comments

TL;DR - image build from main (4c7a19c5eede7ebd196426b947721185c89e8d31 on May 26 2021) is working fine and the above problem is not observed - I managed to create a deployment with 200 replicas without any issues.

Some more details

We are not alone with this problem. Similar difficulties are reported by others with recent versions of OS/Docker/Containerd/runC, ref:

Basically, the issue is rooted in the runC and how PLEG is using docker ps which is timing out, I cannot explain it better than https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-823885293.

Related containerd/runc issues:

Looking into these packages versions, we see that building from main branch today provides upgrades for all packages and fix for the above issue is also included.

GL Version Docker Containerd runC
184.0 19.03.13+dfsg1-2 1.4.1~ds1-1 1.0.0~rc92+dfsg1-5
318.4 20.10.2+dfsg1-2 1.4.3~ds1-1+b1 1.0.0~rc93+ds1-1
421.0 20.10.5+dfsg1-1+b1 1.4.4~ds1-2 1.0.0~rc93+ds1-5

Frankly, I wouldn’t know why to continue with 318.4, so unless we hit a wall with 421.0/the latest version, I would always strive for that one and go back only if that wall is too high.

I vote for building new major version and stop investing into 318.X exactly because a lot of packages received updates which should include security and bug fixes. Also, 318 is based on debian testing state since 100+ days which is quite old. There are some issues with the build currently, for example https://github.com/gardenlinux/gardenlinux/issues/253, but this out of scope of this issue.

While educating my self on high and low level CRIs, I have tested one more config option - containerd+runc. It is working reliably and I do not observe the issue when creating a deployment with 200 replicas.