kubernetes: daemonset pod cannot be created quickly, when some nodes are added to the cluster and controller-manager OOM will be triggered

What happened?

ref: https://github.com/kubernetes/kubernetes/issues/112319

【Question】 question 1: Now our k8s cluster has 5k nodes. When we expand some nodes in the k8s cluster, the daemonset pod cannot be created quickly. It takes about 2 hours to create the daemonset pod.

Add a new node daemonset pod startup time:

question 2: Look at the monitoring of kube-controller-manager Memory usage reaches 900GB and triggers OOM.

kube-controller-manager memory usage monitoring:

【Solution】

In a cluster with 5K nodes, we found that when the daemonset controller watches update Node events, due to the large number of Node events, the NodeInformer pendingNotifications ring buffer has more Node events to be processed, resulting in the Node EventHandler’s Add/Update callback function being processed slowly. Affects the creation of daemonset pod and kube-controller-manager OOM issues.

We optimized the shouldIgnoreNodeUpdate logic in the UpdateNode method of the daemonset controller NodeInformer to reduce unnecessary Node event processing:

original code:

func shouldIgnoreNodeUpdate(oldNode, curNode v1.Node) bool {
	if !nodeInSameCondition(oldNode.Status.Conditions, curNode.Status.Conditions) {
		return false
	}
	oldNode.ResourceVersion = curNode.ResourceVersion
	oldNode.Status.Conditions = curNode.Status.Conditions
	return apiequality.Semantic.DeepEqual(oldNode, curNode)
}

optimized code:

func shouldIgnoreNodeUpdate(oldNode, curNode v1.Node) bool {
	if !nodeInSameCondition(oldNode.Status.Conditions, curNode.Status.Conditions) {
		return false
	}
	oldNode.ResourceVersion = curNode.ResourceVersion
	oldNode.Status.Conditions = curNode.Status.Conditions
+	oldNode.Status.Images = curNode.Status.Images
+	oldNode.Status.NodeInfo = curNode.Status.NodeInfo
+	oldNode.Status.Capacity = curNode.Status.Capacity
+	oldNode.Status.Allocatable = curNode.Status.Allocatable
	return apiequality.Semantic.DeepEqual(oldNode, curNode)
}

func (dsc *DaemonSetsController) updateNode(old, cur interface{}) {
	oldNode := old.(*v1.Node)
	curNode := cur.(*v1.Node)
+	oldNodeCopy := oldNode.DeepCopy()
+	curNodeCopy := curNode.DeepCopy()
	if shouldIgnoreNodeUpdate(*oldNodeCopy, *curNodeCopy) {
		return
	}

	dsList, err := dsc.dsLister.List(labels.Everything())
	if err != nil {
		klog.V(4).Infof("Error listing daemon sets: %v", err)
		return
	}
	// TODO: it'd be nice to pass a hint with these enqueues, so that each ds would only examine the added node (unless it has other work to do, too).
	for _, ds := range dsList {
		_, oldShouldSchedule, oldShouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(oldNode, ds)
		if err != nil {
			klog.Errorf("Error old node %s should run daemon %s/%s pod: %v", oldNode.Name, ds.Namespace, ds.Name, err)
			continue
		}
		_, currentShouldSchedule, currentShouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(curNode, ds)
		if err != nil {
			klog.Errorf("Error current node %s should run daemon %s/%s pod: %v", oldNode.Name, ds.Namespace, ds.Name, err)
			continue
		}
		if (oldShouldSchedule != currentShouldSchedule) || (oldShouldContinueRunning != currentShouldContinueRunning) {
			klog.V(4).Infof("enqueueing daemon set %s/%s for node %v.", ds.Namespace, ds.Name, curNode.Name)
			dsc.enqueueDaemonSet(ds)
		}
	}
}

【Effect】 After adding some nodes to the cluster, the daemonset pod can be quickly created, and the kube-controller-manager memory is controlled at about 20GB.

What did you expect to happen?

When adding some nodes to the k8s cluster, daemonset pod can be created quickly because the node requires CNI daemonset pod.
kube-controller-manager needs to use less memory to prevent frequent OOMtriggers.
We have optimized the shouldIgnoreNodeUpdate logic in the updateNode method of the daemonset controller NodeInformer to reduce unnecessary Node event processing. Is this optimization logic possible?

How can we reproduce it (as minimally and precisely as possible)?

When the k8s cluster size reaches 5k, add new nodes to the cluster.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17+", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
$ uname -a
4.18.0-2.4.3.x86_64 #1 SMP Wed Apr 6 06:31:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

Original URL
State: open
Created 8 months ago
Comments: 27 (27 by maintainers)

Most upvoted comments

@liggitt @aojea @wojtek-t Should we consider changing the action of distribute Objects here to be concurrent?

Definitely not, that would break most users of this library. If listeners want concurrency, they can make their add/update/delete handler add the item to a queue and process it asynchronously.

liggitt on Oct 26, 2023

So, this is one of the reasons people should not do heavy processing on the informers EventHandlers, it blocks the DeltaFIFo queue and increase the memory and drops the performance.

I think that optimizing the shouldIgnoreNodeUpdate to make a more targeted diff on the Node instead of compare the whole node object except Conditions and ResourceVersions will be a quick win.

If my reading of the code is correct, just adding or removing an annotation will make the handler to process that node.

/cc @wojtek-t

for scalability

aojea on Oct 25, 2023

The current state of shouldIgnoreNodeUpdate is mutating the old node (with oldNode.ResourceVersion = curNode.ResourceVersion and oldNode.Status.Conditions = curNode.Status.Conditions). That is modifying the old node object in the shared informer cache, and is absolutely not permitted. That should be tripping the cache mutation detector in e2e runs… I don’t know why it isn’t. That needs a separate issue to fix.

The cache status is not changed because this code performs shadowCopy.

https://github.com/kubernetes/kubernetes/blob/94ec99d4c25e16b6f3c9239d9e124be9d45c161b/pkg/controller/daemon/daemon_controller.go#L698-L700

@liggitt

sxllwx on Oct 26, 2023