rancher: Monitoring crashes on 2.2.0RC-13

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible): Dont know the exact steps but happened a few times. Looks like it crashes when theres a bit of load from Grafana side and struggles to recover. We generate some network traffic with iperf to test something else and then monitor it on the nodes and cluster Grafana page with an update rate of 10s.

Result: Prometheus and prometheus-agent crashes.

level=warn ts=2019-03-25T14:48:52.452784826Z caller=main.go:295 deprecation_notice="\"storage.tsdb.retention\" flag is deprecated use \"storage.tsdb.retention.time\" instead."
level=info ts=2019-03-25T14:48:52.453046068Z caller=main.go:302 msg="Starting Prometheus" version="(version=2.7.1, branch=HEAD, revision=62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)"
level=info ts=2019-03-25T14:48:52.453152698Z caller=main.go:303 build_context="(go=go1.11.5, user=root@f9f82868fc43, date=20190131-11:16:59)"
level=info ts=2019-03-25T14:48:52.453254273Z caller=main.go:304 host_details="(Linux 4.14.85-rancher #1 SMP Sat Dec 1 12:40:08 UTC 2018 x86_64 prometheus-cluster-monitoring-0 (none))"
level=info ts=2019-03-25T14:48:52.453351163Z caller=main.go:305 fd_limits="(soft=1000000, hard=1000000)"
level=info ts=2019-03-25T14:48:52.453444294Z caller=main.go:306 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-03-25T14:48:52.457097107Z caller=web.go:416 component=web msg="Start listening for connections" address=127.0.0.1:9090
level=info ts=2019-03-25T14:48:52.456881773Z caller=main.go:620 msg="Starting TSDB ..."
level=info ts=2019-03-25T14:48:52.462002976Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1553513977668 maxt=1553515200000 ulid=01D6TMPSNVPVZ73PPX7R00MBG9
level=warn ts=2019-03-25T14:48:52.463578724Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/prometheus/wal/00000003
level=warn ts=2019-03-25T14:49:23.371099985Z caller=head.go:440 component=tsdb msg="unknown series references" count=88527
level=info ts=2019-03-25T14:49:23.885054937Z caller=main.go:635 msg="TSDB started"
level=info ts=2019-03-25T14:49:23.885704301Z caller=main.go:695 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2019-03-25T14:49:24.061997086Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-25T14:49:24.08395222Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-25T14:49:24.09450321Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-25T14:49:24.10044777Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-25T14:49:24.118638441Z caller=main.go:722 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2019-03-25T14:49:24.119196738Z caller=main.go:589 msg="Server is ready to receive web requests."
level=error ts=2019-03-25T14:49:27.483144369Z caller=notifier.go:481 component=notifier alertmanager=http://alertmanager-operated.cattle-prometheus:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://alertmanager-operated.cattle-prometheus:9093/api/v1/alerts: dial tcp: lookup alertmanager-operated.cattle-prometheus on 10.43.0.10:53: no such host"

INFO[2019-03-25T14:53:47Z] listening on 10.42.224.7:9090, proxying to http://localhost:9090 with ignoring 'remote reader' labels [prometheus,prometheus_replica], only allow maximum 512 connections with 5m0s read timeout .
INFO[2019-03-25T14:53:47Z] Start listening for connections on 10.42.224.7:9090
2019/03/25 14:53:55 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:53:56 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:05 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:06 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:15 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:16 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:25 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:26 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:35 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:36 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:45 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused
2019/03/25 14:54:46 http: proxy error: dial tcp 127.0.0.1:9090: connect: connection refused

Environment information

Used latest Rancher RC (rancher/rancher:v2.2.0-rc13)
3 node HA install via RKE

Cluster information

Machine type (cloud/VM/metal) and specifications (CPU/memory):

vmware 4 vCPU, 8GB mem

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Docker version (use docker version):

docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:20:43 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:28:38 2018
  OS/Arch:          linux/amd64
  Experimental:     false

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 24 (6 by maintainers)

Most upvoted comments

Version: Master (4/4/19) ImageID: 10a1ef90fcbd

The UI now has new defaults for cluster monitoring as pictured below. As you can see we can now set the Node Exporter CPU Limit and Memory Limit. Additionally, the Prometheus default memory limit is now 1000MiB which is a more reasonable value to start with.

I ensured that we can enable cluster monitoring and these defaults and custom settings take effect. Further, we can make changes here, save, and the node exporter workload will restart the pods so the containers now have the updated values for node exporter cpu limit and memory limit.

HTTP / UI looks good. There is also now a minimum requirement for your cluster to have at least 1200 milliCPU available and this working see #19078

I smoke tested this area, tried putting in various values for node exporter CPU/Memory limits and did not encounter any issues. Each time the pods restarted and set new limits I specified.

The defaults are: 30Mi memory reservation (hard coded default) 200Mi memory limit (configurable when go to Tools > Monitoring)

100m cpu reservation (hard coded default) 200m cpu limit (configurable when go to Tools > Monitoring)

Upon submit enable/edit monitoring default answers are (HTTP): exporter-node.resources.limits.cpu: "200m" exporter-node.resources.limits.memory: "200Mi"

davidnuzik on Apr 4, 2019

@rbq Thanks for your update. The current 50Mi limit for node exporter is too small. We will increase the default value and make it configurable via Rancher UI. It will be out in 2.2.2.

For anyone else running into the same issue such as OOMKill, please increase the Prometheus or Node exporter cpu/memory limitations in enable monitoring page.

https://rancher.com/docs/rancher/v2.x/en/cluster-admin/tools/monitoring/#resource-consumption

loganhz on Apr 3, 2019