prometheus-operator: Unrecoverable OOM

I tested prometheus-operator with avalanche https://github.com/open-fresh/avalanche. This tool creates mock scrape points for prometheus to scrape data from. The prometheus instance was created with cpu and memory requests and limits set. This ensured the QoS was set to guaranteed by kubernetes. The instance was also configured with persistent storage.

Then I increased load in steps by increasing deployment size of avalanche, till at one point prometheus crashed due to OOM. After OOM crash, it never recovered.

Did you expect to see some different? Prometheus instance recover successfully after OOM

How to reproduce it (as minimally and precisely as possible): Configure prometheus instance as 1CPU 4G memory. Configure service monitor to scrape from avalanche. Start avalanche deployment with size of 8.

Environment Kubernetes running on AWS. (Non EKS)

  • Prometheus Operator version: quay.io/coreos/prometheus-operator:v0.29.0 Insert image tag or Git SHA here

  • Kubernetes version information: Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.0”, GitCommit:“e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529”, GitTreeState:“clean”, BuildDate:“2019-06-20T04:49:16Z”, GoVersion:“go1.12.6”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“12”, GitVersion:“v1.12.9”, GitCommit:“e09f5c40b55c91f681a46ee17f9bc447eeacee57”, GitTreeState:“clean”, BuildDate:“2019-05-27T15:58:45Z”, GoVersion:“go1.10.8”, Compiler:“gc”, Platform:“linux/amd64”} kubectl version

  • Kubernetes cluster kind: Custom cluster on AWS. insert how you created your cluster: kops, bootkube, tectonic-installer, etc.

  • Manifests:

insert manifests relevant to the issue
  • Prometheus Operator Logs:
insert Prometheus Operator logs relevant to the issue here

Anything else we need to know?: Prometheus, upon restart from crash reconciles data chunks. This process may take a while. Only after initial processing it starts accepting connections. The readiness and liveness probes configured by operator for prometheus pod only work on clean starts. On restart after OOM the probes are too aggressive. I suggest adding an init pod which recovers all data chunks before starting prometheus normally.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (6 by maintainers)

Most upvoted comments

I came across this issue recently. My prometheus deployment with everything default was working for about a week and then it started OOM crashing. I have tried to increase the initial delay of liveness and readiness probes, but the OOM happens during WAL replays like this:

The prometheus container log

level=info ts=2020-04-10T07:18:06.802Z caller=main.go:331 msg="Starting Prometheus" version="(version=2.16.0, branch=HEAD, revision=b90be6f32a33c03163d700e1452b54454ddce0ec)"
level=info ts=2020-04-10T07:18:06.802Z caller=main.go:332 build_context="(go=go1.13.8, user=root@7ea0ae865f12, date=20200213-23:50:02)"
level=info ts=2020-04-10T07:18:06.803Z caller=main.go:333 host_details="(Linux 4.19.0-0.bpo.6-amd64 #1 SMP Debian 4.19.67-2+deb10u2~bpo9+1 (2019-11-12) x86_64 prometheus-prometheus-0 (none))"
level=info ts=2020-04-10T07:18:06.803Z caller=main.go:334 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-04-10T07:18:06.803Z caller=main.go:335 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-04-10T07:18:06.809Z caller=main.go:661 msg="Starting TSDB ..."
level=info ts=2020-04-10T07:18:06.810Z caller=web.go:508 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-04-10T07:18:06.822Z caller=head.go:577 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-04-10T07:18:06.823Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=132
level=info ts=2020-04-10T07:18:17.191Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=132
level=info ts=2020-04-10T07:18:25.896Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=2 maxSegment=132
level=info ts=2020-04-10T07:18:35.005Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=3 maxSegment=132
level=info ts=2020-04-10T07:18:48.290Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=4 maxSegment=132
level=info ts=2020-04-10T07:18:55.883Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=5 maxSegment=132

The pod status

prometheus-prometheus-0                          2/3     CrashLoopBackOff    1          117s
prometheus-prometheus-0                          2/3     Running             2          2m13s
prometheus-prometheus-0                          2/3     OOMKilled           2          3m10s
prometheus-prometheus-0                          2/3     CrashLoopBackOff    2          3m12s
prometheus-prometheus-0                          2/3     Running             3          3m36s
prometheus-prometheus-0                          2/3     OOMKilled           3          4m33s
prometheus-prometheus-0                          2/3     CrashLoopBackOff    3          4m37s

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.