kubernetes: Many ConfigMaps and Pods slow down cluster, until it becomes unavailable (since 1.12)
What happened:
I schedule multiple jobs in my cluster. Each job uses a different ConfigMap which contains the configuration for that job.
This worked well on version 1.11 of Kubernetes. After upgrading to 1.12 or 1.13, I’ve noticed that doing this will cause the cluster to significantly slow down; up to the point where nodes are being marked as NotReady and no new work is being scheduled.
For example, consider a scenario in which I schedule 400 jobs, each with its own ConfigMap, which print “Hello World” on a single-node cluster would.
On v1.11, it takes about 10 minutes for the cluster to process all jobs. New jobs can be scheduled. On v1.12 and v1.13, it takes about 60 minutes for the cluster to process all jobs. After this, no new jobs can be scheduled.
What you expected to happen:
I did not expect this scenario to cause my nodes to become unavailable in Kubernetes 1.12 and 1.13, and would have expected the behavior which I observe in 1.11.
How to reproduce it (as minimally and precisely as possible):
The easiest way seems to be to schedule, on a single-node cluster, about 300 jobs:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: job-%JOB_ID%
data:
# Just some sample data
game.properties: |
enemies=aliens
---
apiVersion: batch/v1
kind: Job
metadata:
name: job-%JOB_ID%
spec:
template:
spec:
containers:
- name: busybox
image: busybox
command: [ "/bin/echo" ]
args: [ "Hello, World!" ]
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: job-%JOB_ID%
restartPolicy: Never
backoffLimit: 4
I can consistently reproduce this issue in a VM-based environment, which I configure using Vagrant. You can find the full setup here: https://github.com/qmfrederik/k8s-job-repro
Anything else we need to know?:
Happy to provide further information as needed
Environment:
- Kubernetes version (use
kubectl version
): v1.12 through v1.13 - Cloud provider or hardware configuration: bare metal
- OS (e.g:
cat /etc/os-release
): 18.04.1 LTS (Bionic Beaver) - Kernel (e.g.
uname -a
): Linux vagrant 4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux - Install tools: kubeadm
- Others:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 38 (31 by maintainers)
Commits related to this issue
- kubelet: use cache configMap and secrets change strategy Watched based strategy has a couple bugs, 1) golang http2 max streams blocking when the stream limit is reached and 2) the kubelet not cleanin... — committed to rphillips/machine-config-operator by rphillips 5 years ago
- Fix application-apply of stx-openstack on simplex The application-apply of the stx-openstack application on simplex configurations has been failing since the barbican chart was added to the applicati... — committed to openstack-archive/stx-config by deleted user 5 years ago
- Update kubernetes config for 1.15 features. Upgrading from kubernetes 1.13.5 to 1.15.0 meant the config needed to be updated to handle whatever was deprecated or dropped in 1.14 and 1.15. 1) Removed... — committed to starlingx-staging/openstack-armada-app-test by albailey-wr 5 years ago
- Update kubernetes config for 1.15 features. Upgrading from kubernetes 1.13.5 to 1.15.0 meant the config needed to be updated to handle whatever was deprecated or dropped in 1.14 and 1.15. 1) Removed... — committed to starlingx-staging/platform-armada-app by albailey-wr 5 years ago
- Fix application-apply of stx-openstack on simplex The application-apply of the stx-openstack application on simplex configurations has been failing since the barbican chart was added to the applicati... — committed to starlingx-staging/puppet by deleted user 5 years ago
- Update kubernetes config for 1.15 features. Upgrading from kubernetes 1.13.5 to 1.15.0 meant the config needed to be updated to handle whatever was deprecated or dropped in 1.14 and 1.15. 1) Removed... — committed to starlingx-staging/puppet by albailey-wr 5 years ago
- Update kubernetes config for 1.15 features. Upgrading from kubernetes 1.13.5 to 1.15.0 meant the config needed to be updated to handle whatever was deprecated or dropped in 1.14 and 1.15. 1) Removed... — committed to starlingx-staging/stx-config by albailey-wr 5 years ago
to summarize:
There are two mitigations with current 1.12/1.13 versions:
--http2-max-streams-per-connection
settingconfigMapAndSecretChangeDetectionStrategy: "Cache"
There are two actual bugs to be fixed:
Heh… - I think I know where the problem is. The problem is that we delete the reference to the pod (and this is what stops the watch) when we UnregisterPod: https://github.com/kubernetes/kubernetes/blob/3478647333c91689cf4c737012a60e6d70a661e7/pkg/kubelet/util/manager/cache_based_manager.go#L244
And this one is triggered only by pod deletion: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod/pod_manager.go#L207
The problem is that pods that are owned by Jobs are not deleted (they are eventually garbage-collected). So what happens, is that eventually, you end up with many more pods being effectively on that node (though a lot of them are already in “Succeeded” state).
So it seems there are two problems here:
Also:
@yue9944882 - this won’t help in general, because it may be valid to have more then 250 connections (if there are more than that many different secrets/configmaps). Why we don’t create a new connection if we approach the limit of streams in a single one?