kubernetes: Kubelet sysctl settings cannot be overridden
What happened: The kubelet overrides user defined sysctl parameters silently.
What you expected to happen:
Values defined by the administrator in /etc/sysctl.d/*.conf
files should take priority or at least not be silently overridden.
How to reproduce it (as minimally and precisely as possible):
- Create a file in
/etc/sysctl.d
that conflicts with one that kubelet sets on startup. -echo "vm.panic_on_oom = 1" > /etc/sysctl.d/30-panic-on-oom.conf
is a good example. - Reboot
- After boot the value specified through sysctl is overridden by kubelet.
Anything else we need to know?:
This behavior is particularly tricky to debug because the kubelet is almost always going to start after the systemd-sysctl service. The end result here is that the system boots and claims to have successfully applied all user defined kernel configuration, but then inspecting manually with the sysctl
utility doesn’t show the expected values.
I can understand why the kubelet would want to discover the state of the oom panic toggle on startup, but the changes made for #15091 appear to have completely hard-coded this behavior and actively fight against configuration managed through traditional Linux management options (these parameters would all be valid on the GRUB command line or through systemd setting sysctl parameters on startup).
I can see a few options that would be a reasonable way to handle this:
- Offer a configuration option on the kubelet to override the behavior.
- Stop managing sysctl parameters that are not fatal to the kubelet running and ship recommended parameters as sysctl.d config files through post-install tooling for the kubelet package in a package manager (deb/rpm) or by letting provisioning tools create those files. Parameters that are considered best-practice but are not required for the kubelet to start can throw a warning.
Environment:
- Kubernetes version (use
kubectl version
): 1.12.4 - Cloud provider or hardware configuration: Azure
- OS (e.g. from /etc/os-release): Ubuntu 16.04
- Kernel (e.g.
uname -a
):4.15.0-1037-azure
- Install tools: AKS Engine
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 2
- Comments: 27 (9 by maintainers)
We are attempting to move to v1.19.7 and ran into the issue where kubelet overwrites
vm.panic_on_oom = 1
sysctl setting which is a setting we need enabled.While there is some crude logic behind how the kernel kills PIDs during an OOM situation it still, in our opinion, leaves the system in an unpredictable state that may not have resolved the real reason the node ran out of memory in the first place. We believe that it would be better to kernel panic, do a core dump for later analysis, and get the node back into a known good state.
Based on the https://github.com/kubernetes/kubernetes/issues/74151#issuecomment-481750191 by @bazzargh, it seems the functionality if changing the
vm.panic_on_oom
setting may no longer be required.Moving the behavior to fail by default and warn as mentioned in https://github.com/kubernetes/kubernetes/issues/74151#issuecomment-553529178 I think makes a lot of sense. We could keep the feature where kubelet checks the
vm.panic_on_oom
and by default fails to start. WhenprotectKernelDefaults: true
a warning would be logged.Examining the code, it seems we could accomplish adding the functionality of just dropping a warning in the log by changing the case when
protectKernelDefaults: false
with an update of the below line tob := KernelTunableWarn
: https://github.com/kubernetes/kubernetes/blob/b11d0fbdd58394a62622787b38e98a620df82750/pkg/kubelet/cm/container_manager_linux.go#L473@cchildress thoughts?
Yes, I think that’s their point.
My larger point and a big part of why I wanted to file this issue is I think it is very bad behavior for the kubelet to silently override node configuration that is being explicitly declared by an administrator on the node. I’m perfectly happy to see kubelet check node conditions and refuse to start and/or print a nice warning if you have something configured on the node that conflicts with the kubelet running correctly. Having it revert configuration changes like this makes debugging more challenging and and prevents administrators from tailoring their systems to their needs.
Dup issues: #66693 #90829 #50110
Why the
KernelTunableWarn
is not configurable? There are some context in https://github.com/kubernetes/kubernetes/pull/27874#discussion_r68314751It’s time to consider if
expose "KernelTunableWarn" in kublet config #50110
is needed. 😄The workaround is to
changing sysctls after kubelet start each time
. /kind feature/remove-lifecycle stale