kubernetes: daemonset pod cannot be created quickly, when some nodes are added to the cluster and controller-manager OOM will be triggered
What happened?
ref: https://github.com/kubernetes/kubernetes/issues/112319
【Question】 question 1: Now our k8s cluster has 5k nodes. When we expand some nodes in the k8s cluster, the daemonset pod cannot be created quickly. It takes about 2 hours to create the daemonset pod.
Add a new node daemonset pod startup time:
question 2:
Look at the monitoring of kube-controller-manager
Memory usage reaches 900GB and triggers OOM.
kube-controller-manager
memory usage monitoring:
【Solution】
In a cluster with 5K nodes, we found that when the daemonset controller watches update Node
events, due to the large number of Node
events, the NodeInformer
pendingNotifications ring buffer has more Node
events to be processed, resulting in the Node EventHandler’s Add/Update
callback function being processed slowly. Affects the creation of daemonset pod and kube-controller-manager
OOM
issues.
We optimized the shouldIgnoreNodeUpdate logic in the UpdateNode method of the daemonset controller NodeInformer
to reduce unnecessary Node event processing:
original code:
func shouldIgnoreNodeUpdate(oldNode, curNode v1.Node) bool {
if !nodeInSameCondition(oldNode.Status.Conditions, curNode.Status.Conditions) {
return false
}
oldNode.ResourceVersion = curNode.ResourceVersion
oldNode.Status.Conditions = curNode.Status.Conditions
return apiequality.Semantic.DeepEqual(oldNode, curNode)
}
optimized code:
func shouldIgnoreNodeUpdate(oldNode, curNode v1.Node) bool {
if !nodeInSameCondition(oldNode.Status.Conditions, curNode.Status.Conditions) {
return false
}
oldNode.ResourceVersion = curNode.ResourceVersion
oldNode.Status.Conditions = curNode.Status.Conditions
+ oldNode.Status.Images = curNode.Status.Images
+ oldNode.Status.NodeInfo = curNode.Status.NodeInfo
+ oldNode.Status.Capacity = curNode.Status.Capacity
+ oldNode.Status.Allocatable = curNode.Status.Allocatable
return apiequality.Semantic.DeepEqual(oldNode, curNode)
}
func (dsc *DaemonSetsController) updateNode(old, cur interface{}) {
oldNode := old.(*v1.Node)
curNode := cur.(*v1.Node)
+ oldNodeCopy := oldNode.DeepCopy()
+ curNodeCopy := curNode.DeepCopy()
if shouldIgnoreNodeUpdate(*oldNodeCopy, *curNodeCopy) {
return
}
dsList, err := dsc.dsLister.List(labels.Everything())
if err != nil {
klog.V(4).Infof("Error listing daemon sets: %v", err)
return
}
// TODO: it'd be nice to pass a hint with these enqueues, so that each ds would only examine the added node (unless it has other work to do, too).
for _, ds := range dsList {
_, oldShouldSchedule, oldShouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(oldNode, ds)
if err != nil {
klog.Errorf("Error old node %s should run daemon %s/%s pod: %v", oldNode.Name, ds.Namespace, ds.Name, err)
continue
}
_, currentShouldSchedule, currentShouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(curNode, ds)
if err != nil {
klog.Errorf("Error current node %s should run daemon %s/%s pod: %v", oldNode.Name, ds.Namespace, ds.Name, err)
continue
}
if (oldShouldSchedule != currentShouldSchedule) || (oldShouldContinueRunning != currentShouldContinueRunning) {
klog.V(4).Infof("enqueueing daemon set %s/%s for node %v.", ds.Namespace, ds.Name, curNode.Name)
dsc.enqueueDaemonSet(ds)
}
}
}
【Effect】
After adding some nodes to the cluster, the daemonset pod can be quickly created, and the kube-controller-manager
memory is controlled at about 20GB.
What did you expect to happen?
- When adding some nodes to the k8s cluster, daemonset pod can be created quickly because the node requires CNI daemonset pod.
- kube-controller-manager needs to use less memory to prevent frequent
OOM
triggers. - We have optimized the
shouldIgnoreNodeUpdate
logic in theupdateNode
method of the daemonset controllerNodeInformer
to reduce unnecessaryNode
event processing. Is this optimization logic possible?
How can we reproduce it (as minimally and precisely as possible)?
When the k8s cluster size reaches 5k, add new nodes to the cluster.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17+", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
$ uname -a
4.18.0-2.4.3.x86_64 #1 SMP Wed Apr 6 06:31:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 27 (27 by maintainers)
Definitely not, that would break most users of this library. If listeners want concurrency, they can make their add/update/delete handler add the item to a queue and process it asynchronously.
So, this is one of the reasons people should not do heavy processing on the informers EventHandlers, it blocks the DeltaFIFo queue and increase the memory and drops the performance.
I think that optimizing the
shouldIgnoreNodeUpdate
to make a more targeted diff on the Node instead of compare the whole node object except Conditions and ResourceVersions will be a quick win.If my reading of the code is correct, just adding or removing an annotation will make the handler to process that node.
/cc @wojtek-t
for scalability
The cache status is not changed because this code performs shadowCopy.
https://github.com/kubernetes/kubernetes/blob/94ec99d4c25e16b6f3c9239d9e124be9d45c161b/pkg/controller/daemon/daemon_controller.go#L698-L700
@liggitt