kubernetes: Many ConfigMaps and Pods slow down cluster, until it becomes unavailable (since 1.12)

What happened:

I schedule multiple jobs in my cluster. Each job uses a different ConfigMap which contains the configuration for that job.

This worked well on version 1.11 of Kubernetes. After upgrading to 1.12 or 1.13, I’ve noticed that doing this will cause the cluster to significantly slow down; up to the point where nodes are being marked as NotReady and no new work is being scheduled.

For example, consider a scenario in which I schedule 400 jobs, each with its own ConfigMap, which print “Hello World” on a single-node cluster would.

On v1.11, it takes about 10 minutes for the cluster to process all jobs. New jobs can be scheduled. On v1.12 and v1.13, it takes about 60 minutes for the cluster to process all jobs. After this, no new jobs can be scheduled.

What you expected to happen:

I did not expect this scenario to cause my nodes to become unavailable in Kubernetes 1.12 and 1.13, and would have expected the behavior which I observe in 1.11.

How to reproduce it (as minimally and precisely as possible):

The easiest way seems to be to schedule, on a single-node cluster, about 300 jobs:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: job-%JOB_ID%
data:
# Just some sample data
  game.properties: |
    enemies=aliens
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-%JOB_ID%
spec:
  template:
    spec:
      containers:
      - name: busybox
        image: busybox
        command: [ "/bin/echo" ]
        args: [ "Hello, World!" ]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: job-%JOB_ID%
      restartPolicy: Never
  backoffLimit: 4

I can consistently reproduce this issue in a VM-based environment, which I configure using Vagrant. You can find the full setup here: https://github.com/qmfrederik/k8s-job-repro

Anything else we need to know?:

Happy to provide further information as needed

Environment:

  • Kubernetes version (use kubectl version): v1.12 through v1.13
  • Cloud provider or hardware configuration: bare metal
  • OS (e.g: cat /etc/os-release): 18.04.1 LTS (Bionic Beaver)
  • Kernel (e.g. uname -a): Linux vagrant 4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 38 (31 by maintainers)

Commits related to this issue

Most upvoted comments

to summarize:

There are two mitigations with current 1.12/1.13 versions:

  • start the apiserver with a higher --http2-max-streams-per-connection setting
  • start the kubelet with a config file that switches back to the pre-1.12 secret/configmap lookup method: configMapAndSecretChangeDetectionStrategy: "Cache"

There are two actual bugs to be fixed:

  • kubelet not cleaning up watches for secrets/configmaps for terminated pods (fix in progress in #74730, can be backported to 1.12/1.13)
  • the golang http/2 behavior of not opening new connections once the stream limit is reached (fixed in go1.12, unclear yet whether it will be possible to rebuild patch releases of k8s 1.12/1.13 with a new go version)

Heh… - I think I know where the problem is. The problem is that we delete the reference to the pod (and this is what stops the watch) when we UnregisterPod: https://github.com/kubernetes/kubernetes/blob/3478647333c91689cf4c737012a60e6d70a661e7/pkg/kubelet/util/manager/cache_based_manager.go#L244

And this one is triggered only by pod deletion: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod/pod_manager.go#L207

The problem is that pods that are owned by Jobs are not deleted (they are eventually garbage-collected). So what happens, is that eventually, you end up with many more pods being effectively on that node (though a lot of them are already in “Succeeded” state).

So it seems there are two problems here:

  • one is that we should probably unregister pod that is no-longer running (and won’t be restarted)
  • second (and more serious in my opinion) is why new connection is not created when we approach the limit

Also:

yes, to clarify, i mean we can probably limit the WATCH connections (under 250) for kubelet at server-side to “make room” for the patch calls at client-side. will this help the case?

@yue9944882 - this won’t help in general, because it may be valid to have more then 250 connections (if there are more than that many different secrets/configmaps). Why we don’t create a new connection if we approach the limit of streams in a single one?