kops: Kubelet 'failed to get cgroup stats for "/system.slice/kubelet.service"' error messages

What kops version are you running? The command kops version, will display this information. Version 1.8.0 (git-5099bc5)
What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: version.Info{Major:"1", Minor:"7+", GitVersion:"v1.7.9-dirty", GitCommit:"7f63532e4ff4fbc7cacd96f6a95b50a49a2dc41b", GitTreeState:"dirty", BuildDate:"2017-10-26T22:33:15Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

What cloud provider are you using? AWS
What commands did you run? What is the simplest way to reproduce this issue?

Provisioned a 1.8.4 cluster by:

Creating a basic cluster with kops
Exporting that cluster to a manifest file
Added my changes to manfiest
Generate terraform configs

What happened after the commands executed? I get the following error message in the daemon logs unless I add the following to the manifest:

  kubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"
  masterKubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"

Dec 12 22:12:14 ip-172-20-64-61 kubelet[30742]: E1212 22:12:14.322073   30742 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"

What did you expect to happen? No error messages

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 6
Comments: 32 (13 by maintainers)

Links to this issue

Error migrating slaves nodes

Commits related to this issue

Adding additional fix for https://github.com/kubernetes/kops/issues/4049 — committed to mmerrill3/kops by deleted user 6 years ago
Revert "Adding additional fix for https://github.com/kubernetes/kops/issues/4049" This reverts commit 877a90f9306e5671ba7da85270a77220ca6b216b. — committed to mmerrill3/kops by deleted user 6 years ago

Most upvoted comments

See https://github.com/kontena/pharos-cluster/issues/440#issuecomment-399014418 for why the --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice workaround is a bad idea on CentOS: the extra /systemd prefix causes the kubelet and dockerd processes to escape from their correct systemd cgroups into a new /systemd/system.slice cgroup created next to the real /system.slice cgroup.

I’m not entirely sure how much the systemd cgroup names differ across OSes, but I assume that the workaround was actually meant to be --runtime-cgroups=/system.slice --kubelet-cgroups=/system.slice… that’s only slightly better https://github.com/kontena/pharos-cluster/issues/440#issuecomment-399017863: the processes still escape from the systemd kubelet.service / docker.service cgropus, and the kubelet /stats/summary API still reports the wrong numbers.

The correct fix is to enable systemd CPUAccounting and MemoryAccounting for the kubelet.service… this causes systemd to create missing the /system.slice/*.service cgroups for all services, and matches what happens by default on e.g. Ubuntu xenial. This allows the kubelet /stats/summary API to report the correct systemContainer metrics for the runtime and kubelet: https://github.com/kontena/pharos-cluster/issues/440#issuecomment-399022473

I think these systemd settings should be shipped as part of the upstream kubelet package’s kubelet.service, if the kubelet assumes that systemd creates those cgroups?

`/etc/systemd/system/kubelet.service.d/11-cgroups.conf`

[Service]
CPUAccounting=true
MemoryAccounting=true

+15

SpComb on Jun 21, 2018

@mercantiandrea @oliverseal Seems you’re doing manually what’s done automatically for you, i.e. https://github.com/kubernetes/kops/issues/4049#issuecomment-352152838. You only need to add these when you edit your cluster:

spec:
  kubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"
  masterKubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"

+12

itskingori on Apr 30, 2018

this workaround works also on AWS default image for kops (k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02 (ami-bd229ec4))

sudo vim /etc/sysconfig/kubelet

add at the end of DAEMON_ARGS string:

--runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice

finally:

sudo systemctl restart kubelet

But I think that this problem should be fixed in the kops release

+12

mercantiandrea on Dec 30, 2017

hi all not fixed in 1.11 😦 , just setup release 1.11 and still have the bug , the config file has changed it’s now a yaml file , where we can add the workaround thanks

Guezennec on Aug 2, 2018

We should fix this in 1.11

justinsb on Jul 20, 2018

I think his solution of adding:

  kubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"
  masterKubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"

as a default seems sane, this slice does exist and may help provide some metrics (I’m not sure where they’re exposed, or what we’re missing without them yet)

blakebarnett on Dec 16, 2017

@itskingori the point is this should be the default settings instead of us going in the config and making changes since we know default values are wrong. One by one things like this add up and you end up with a long list of custom settings you need to apply every time you create a cluster.

igoratencompass on May 26, 2018

We can close this but I wouldn’t call it a duplicate of #3762 unless you change the title. That issue is a recommendation to follow best practices, this is an error message. I’m trying to prepare new 1.8 clusters for critical production workloads. Can someone clarify if:

This error message is benign and should be ignored
The listed workaround using kubeletCgroups and runtimeCgroups should be used until #3762 is addressed.
We aren’t sure what the impact of this error message is and therefore clusters experiencing this error message should probably not be used in a production environment.

jkemp101 on Dec 16, 2017

@oliverseal You can modify the user-data bash script in the terraform template. It is encoded in base64 but you can easily decode, modify and finally encode again.

mercantiandrea on Apr 22, 2018

Added this to the userdata after “download-release”:

sed -i 's@--network-plugin-dir=/opt/cni/bin/@--network-plugin-dir=/opt/cni/bin/ --runtime-cgroups=/systemd/system.s--kubelet-cgroups=/systemd/system.slice@' /etc/sysconfig/kubelet\"
systemctl restart kubelet

and it seems to be working fine.

oliverseal on Apr 29, 2018