prometheus: Grafana 4.0.0 causes Prometheus denial of service

What did you do?

Upgraded to Grafana 4.0.0

What did you expect to see?

Snazzy charts as always.

What did you see instead? Under which circumstances?

Prometheus ran out of filehandles, didn’t crash, but stopped ingesting new data.

Grafana 4.0.0 introduces a bug in which each chart data gather is issued in a new HTTP connection, and these connections appear to be kept alive in Grafana even though they are never used again. The issue is described here: https://github.com/grafana/grafana/issues/6759

Clearly Grafana needs to fix this on their end, but it exposes a problem in that Prometheus can be DoSed simply by opening too many connections. Prometheus should be able to fend off a misbehaving frontend without stopping backend data ingestion.

Environment

System information:

Linux 3.13.0-76-generic x86_64

Prometheus version:

prometheus, version 1.4.1 (branch: master, revision: 2a89e8733f240d3cd57a6520b52c36ac4744ce12)
  build user:       root@e685d23d8809
  build date:       20161128-09:59:22
  go version:       go1.7.3

Alertmanager version: N/A
Prometheus configuration file: N/A
Alertmanager configuration file: N/A
Logs:

Thousands of lines like this:

ERRO[2900] http: Accept error: accept tcp [::]:9090: accept4: too many open files; retrying in 1s
ERRO[2898] http: Accept error: accept tcp [::]:9090: accept4: too many open files; retrying in 1s
ERRO[2898] Error refreshing service xxx_exporter: Get http://localhost:8500/v1/catalog/service/xxx_exporter?index=47790194&wait=30000ms: dial tcp 127.0.0.1:8500: socket: too many open files  source=consul.go:252
ERRO[2900] Error dropping persisted chunks: open /prometheus/data/1b/949dcb022d6cfb.db: too many open files  source=storage.go:1495
WARN[2900] Series quarantined.                           fingerprint=7795383ca895e3d7 metric=node_netstat_TcpExt_TCPHPAcks{environment="production", host="xxx", instance="xxx", job="node_exporter", role="xxx", zone="xxx"} reason=open /prometheus/data/77/95383ca895e3d7.db: too many open files source=storage.go:1646
ERRO[2900] Error while checkpointing: open /prometheus/data/heads.db.tmp: too many open files  source=storage.go:1252
ERRO[2900] Error dropping persisted chunks: open /prometheus/data/1b/94b35810940463.db: too many open files  source=storage.go:1495
ERRO[2900] Error while checkpointing: open /prometheus/data/heads.db.tmp: too many open files  source=storage.go:1252
WARN[2900] Series quarantined.                           fingerprint=baf8faede9e50c85 metric=node_vmstat_nr_tlb_local_flush_all{environment="production", host="xxx", instance="xxx", job="node_exporter", role="xxx", zone="xxx"} reason=open /prometheus/data/ba/f8faede9e50c85.db: too many open files source=storage.go:1646

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 19 (13 by maintainers)

Commits related to this issue

Set read-timeout for http.Server This also specifies a timeout for idle client connections, which may cause "too many open files" errors. See #2238 — committed to agaoglu/prometheus by agaoglu 8 years ago

Most upvoted comments

Repoter stated clearly that although there is obvious problem with Grafana, it shouldn’t be possible to DoS Prometheus down. I was hit by this bug as well and all other datasources remained usable for Grafana, but Prometheus didn’t and it wasn’t able to collect data either. THAT’s the problem that needs to be addressed in Prometheus.

hasso on Dec 1, 2016

That’s unfortunately a Grafana problem. Can you file a bug report in their repository?

@stuartnelson3 isn’t that exactly the bug we fixed for them in a past Grafana version?

fabxc on Dec 1, 2016

Thanks everyone for re-reading my suggestion that Prometheus should not stop collecting and writing data when it has a misbehaving client. Increasing the FD limit is basically a whackamole non-solution, for any limit someone will come along with an even-more-misbehaving client (i.e. build a wall, someone brings a taller ladder). Some ideas I had thinking about this:

Idea: provide a config option for maximum number of client connections allowed. If this is exhausted, then clients are refused gracefully (HTTP 503, or even 429 “Too Many Requests”)
Idea: provide a config option to kill a keep-alive client connection if it does not issue any queries.
Idea: for a given FD limit (Prometheus could ask the environment with syscall.Getrlimit, for example) figure out how many file handles are needed for the known set of data points, and allow only as many active connections as remain. I don’t necessarily like this idea because it’s magic and possibly platform-dependent.

Ideas 1 & 2 together would be a pretty robust solution.

sodabrew on Dec 1, 2016

Apologies, you are right – I missed that part in my morning dizziness.

fabxc on Dec 1, 2016