prometheus: Grafana 4.0.0 causes Prometheus denial of service
What did you do?
Upgraded to Grafana 4.0.0
What did you expect to see?
Snazzy charts as always.
What did you see instead? Under which circumstances?
Prometheus ran out of filehandles, didn’t crash, but stopped ingesting new data.
Grafana 4.0.0 introduces a bug in which each chart data gather is issued in a new HTTP connection, and these connections appear to be kept alive in Grafana even though they are never used again. The issue is described here: https://github.com/grafana/grafana/issues/6759
Clearly Grafana needs to fix this on their end, but it exposes a problem in that Prometheus can be DoSed simply by opening too many connections. Prometheus should be able to fend off a misbehaving frontend without stopping backend data ingestion.
Environment
- System information:
Linux 3.13.0-76-generic x86_64
- Prometheus version:
prometheus, version 1.4.1 (branch: master, revision: 2a89e8733f240d3cd57a6520b52c36ac4744ce12)
build user: root@e685d23d8809
build date: 20161128-09:59:22
go version: go1.7.3
-
Alertmanager version: N/A
-
Prometheus configuration file: N/A
-
Alertmanager configuration file: N/A
-
Logs:
Thousands of lines like this:
ERRO[2900] http: Accept error: accept tcp [::]:9090: accept4: too many open files; retrying in 1s
ERRO[2898] http: Accept error: accept tcp [::]:9090: accept4: too many open files; retrying in 1s
ERRO[2898] Error refreshing service xxx_exporter: Get http://localhost:8500/v1/catalog/service/xxx_exporter?index=47790194&wait=30000ms: dial tcp 127.0.0.1:8500: socket: too many open files source=consul.go:252
ERRO[2900] Error dropping persisted chunks: open /prometheus/data/1b/949dcb022d6cfb.db: too many open files source=storage.go:1495
WARN[2900] Series quarantined. fingerprint=7795383ca895e3d7 metric=node_netstat_TcpExt_TCPHPAcks{environment="production", host="xxx", instance="xxx", job="node_exporter", role="xxx", zone="xxx"} reason=open /prometheus/data/77/95383ca895e3d7.db: too many open files source=storage.go:1646
ERRO[2900] Error while checkpointing: open /prometheus/data/heads.db.tmp: too many open files source=storage.go:1252
ERRO[2900] Error dropping persisted chunks: open /prometheus/data/1b/94b35810940463.db: too many open files source=storage.go:1495
ERRO[2900] Error while checkpointing: open /prometheus/data/heads.db.tmp: too many open files source=storage.go:1252
WARN[2900] Series quarantined. fingerprint=baf8faede9e50c85 metric=node_vmstat_nr_tlb_local_flush_all{environment="production", host="xxx", instance="xxx", job="node_exporter", role="xxx", zone="xxx"} reason=open /prometheus/data/ba/f8faede9e50c85.db: too many open files source=storage.go:1646
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 19 (13 by maintainers)
Repoter stated clearly that although there is obvious problem with Grafana, it shouldn’t be possible to DoS Prometheus down. I was hit by this bug as well and all other datasources remained usable for Grafana, but Prometheus didn’t and it wasn’t able to collect data either. THAT’s the problem that needs to be addressed in Prometheus.
That’s unfortunately a Grafana problem. Can you file a bug report in their repository?
@stuartnelson3 isn’t that exactly the bug we fixed for them in a past Grafana version?
Thanks everyone for re-reading my suggestion that Prometheus should not stop collecting and writing data when it has a misbehaving client. Increasing the FD limit is basically a whackamole non-solution, for any limit someone will come along with an even-more-misbehaving client (i.e. build a wall, someone brings a taller ladder). Some ideas I had thinking about this:
syscall.Getrlimit
, for example) figure out how many file handles are needed for the known set of data points, and allow only as many active connections as remain. I don’t necessarily like this idea because it’s magic and possibly platform-dependent.Ideas 1 & 2 together would be a pretty robust solution.
Apologies, you are right – I missed that part in my morning dizziness.