kops: Kubelet 'failed to get cgroup stats for "/system.slice/kubelet.service"' error messages

  1. What kops version are you running? The command kops version, will display this information. Version 1.8.0 (git-5099bc5)

  2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: version.Info{Major:"1", Minor:"7+", GitVersion:"v1.7.9-dirty", GitCommit:"7f63532e4ff4fbc7cacd96f6a95b50a49a2dc41b", GitTreeState:"dirty", BuildDate:"2017-10-26T22:33:15Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using? AWS

  2. What commands did you run? What is the simplest way to reproduce this issue?

Provisioned a 1.8.4 cluster by:

  • Creating a basic cluster with kops
  • Exporting that cluster to a manifest file
  • Added my changes to manfiest
  • Generate terraform configs
  1. What happened after the commands executed? I get the following error message in the daemon logs unless I add the following to the manifest:
  kubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"
  masterKubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"
Dec 12 22:12:14 ip-172-20-64-61 kubelet[30742]: E1212 22:12:14.322073   30742 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
  1. What did you expect to happen? No error messages

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 6
  • Comments: 32 (13 by maintainers)

Commits related to this issue

Most upvoted comments

See https://github.com/kontena/pharos-cluster/issues/440#issuecomment-399014418 for why the --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice workaround is a bad idea on CentOS: the extra /systemd prefix causes the kubelet and dockerd processes to escape from their correct systemd cgroups into a new /systemd/system.slice cgroup created next to the real /system.slice cgroup.

I’m not entirely sure how much the systemd cgroup names differ across OSes, but I assume that the workaround was actually meant to be --runtime-cgroups=/system.slice --kubelet-cgroups=/system.slice… that’s only slightly better https://github.com/kontena/pharos-cluster/issues/440#issuecomment-399017863: the processes still escape from the systemd kubelet.service / docker.service cgropus, and the kubelet /stats/summary API still reports the wrong numbers.

The correct fix is to enable systemd CPUAccounting and MemoryAccounting for the kubelet.service… this causes systemd to create missing the /system.slice/*.service cgroups for all services, and matches what happens by default on e.g. Ubuntu xenial. This allows the kubelet /stats/summary API to report the correct systemContainer metrics for the runtime and kubelet: https://github.com/kontena/pharos-cluster/issues/440#issuecomment-399022473

I think these systemd settings should be shipped as part of the upstream kubelet package’s kubelet.service, if the kubelet assumes that systemd creates those cgroups?

/etc/systemd/system/kubelet.service.d/11-cgroups.conf

[Service]
CPUAccounting=true
MemoryAccounting=true

@mercantiandrea @oliverseal Seems you’re doing manually what’s done automatically for you, i.e. https://github.com/kubernetes/kops/issues/4049#issuecomment-352152838. You only need to add these when you edit your cluster:

spec:
  kubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"
  masterKubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"

this workaround works also on AWS default image for kops (k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02 (ami-bd229ec4))

sudo vim /etc/sysconfig/kubelet

add at the end of DAEMON_ARGS string:

--runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice

finally:

sudo systemctl restart kubelet

But I think that this problem should be fixed in the kops release

hi all not fixed in 1.11 😦 , just setup release 1.11 and still have the bug , the config file has changed it’s now a yaml file , where we can add the workaround thanks

We should fix this in 1.11

I think his solution of adding:

  kubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"
  masterKubelet:
    kubeletCgroups: "/systemd/system.slice"
    runtimeCgroups: "/systemd/system.slice"

as a default seems sane, this slice does exist and may help provide some metrics (I’m not sure where they’re exposed, or what we’re missing without them yet)

@itskingori the point is this should be the default settings instead of us going in the config and making changes since we know default values are wrong. One by one things like this add up and you end up with a long list of custom settings you need to apply every time you create a cluster.

We can close this but I wouldn’t call it a duplicate of #3762 unless you change the title. That issue is a recommendation to follow best practices, this is an error message. I’m trying to prepare new 1.8 clusters for critical production workloads. Can someone clarify if:

  1. This error message is benign and should be ignored
  2. The listed workaround using kubeletCgroups and runtimeCgroups should be used until #3762 is addressed.
  3. We aren’t sure what the impact of this error message is and therefore clusters experiencing this error message should probably not be used in a production environment.

@oliverseal You can modify the user-data bash script in the terraform template. It is encoded in base64 but you can easily decode, modify and finally encode again.

Added this to the userdata after “download-release”:

sed -i 's@--network-plugin-dir=/opt/cni/bin/@--network-plugin-dir=/opt/cni/bin/ --runtime-cgroups=/systemd/system.s--kubelet-cgroups=/systemd/system.slice@' /etc/sysconfig/kubelet\"
systemctl restart kubelet

and it seems to be working fine.