prometheus-operator: Unable to run rules-configmap-reloader in docker 18.9.2 due to 10Mi limit

What did you do? Upgraded to dokcer to 18.9.2

What did you expect to see? Pods continue running

What did you see instead? Under which circumstances? Pods in crashloop with error:

Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused "process_linux.go:390: setting cgroup config for procHooks process caused \"failed to write 10485760 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/kubepods/podcb7754b6-3117-11e9-853f-da195751b071/rules-configmap-reloader/memory.limit_in_bytes: device or resource busy\""": unknown

Environment

  • Prometheus Operator version:
v0.27.0
  • Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-10T23:35:51Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T20:56:12Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:
kubeadm
  • Docker version
18.9.2

This seems to be caused by a change to runc being copied into memory.

Since there are existing PRs open for increasing the memory limit, is it possible to fast-track these into a new release to address the issue for folks mitigating cve-2019-5736?

PR for increasing the limits directly: https://github.com/coreos/prometheus-operator/pull/2371

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 14
  • Comments: 23 (11 by maintainers)

Commits related to this issue

Most upvoted comments

I updated the Deployment for the Prometheus Operator to use 0.29, and it seems as if that fixes the issue with the Prometheus Pods. My Alert Manager Pod is still in a CrashLoopBackOff state, but I’ll dig into that separately (not sure it is related or not).

@scottslowe we ran into the same issue here – after upgrading to prometheus-operator 0.29, alertmanager was still in a crashloopbackoff due to the sidecar not getting its memory bump. I assume there was some sort of race condition in the upgrade, as all it took to eventually fix it was to delete the AlertManager pod, it came back with 25MB memory for the sidecar.

I tried to deploy the operator using the stable helm chart (which is using v0.29), but the configmap-reloader keeps crashing with the same error:

Error: failed to start container "rules-configmap-reloader": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:367: setting cgroup config for procHooks process caused \\\"failed to write 10485760 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/kubepods/burstable/podf0efd209-3757-11e9-a0c5-000d3ab695db/rules-configmap-reloader/memory.limit_in_bytes: device or resource busy\\\"\"": unknown

I noticed that #2403 had increased the memory and CPU requests, but when I described the pod, the memory requests and limits for that container in both alertmanager and prometheus statefulsets are still 10Mi. Should I create a separate issue for this? Are there any changes that need to be propagated to the charts repo?