prometheus: Leaking fds until rlimits reached
Bug Report
What did you do?
Run prom with consul discovery
What did you expect to see?
Prom not become unresponsive and eat fds.
What did you see instead? Under which circumstances?
Environment
-
System information:
Linux 4.15.0-30-generic x86_64
-
Prometheus version:
2.3.2
-
Prometheus configuration file:
global:
scrape_interval: 30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'overwritten-default'
consul_sd_configs:
- server: 'yy:8500'
services: []
tag: 'metrics'
relabel_configs:
- source_labels: ['__meta_consul_service']
regex: '(.*)'
target_label: 'job'
replacement: '$1'
- source_labels: ['__meta_consul_address']
regex: '(.*)'
target_label: '__meta_consul_service_address'
replacement: '$1'
action: 'replace'
- source_labels: ['__meta_consul_service_id']
regex: '(.*)'
target_label: 'instance'
replacement: '$1'
- source_labels: ['__meta_consul_service']
regex: 'containerd'
target_label: '__metrics_path__'
replacement: '/v1/metrics'
- Logs:
Aug 14 05:30:59 aegis-03 boss[2388]: level=error ts=2018-08-14T09:30:59.703921368Z caller=compact.go:432 component=tsdb msg="removed tmp folder after failed compaction" err="open /var/lib/prometheus/01CMVWCBQQS6RTQMR7PVRSD1BR.tmp: too many open files"
Aug 14 05:30:59 aegis-03 boss[2388]: level=error ts=2018-08-14T09:30:59.703982733Z caller=db.go:272 component=tsdb msg="compaction failed" err="persist head block: open chunk writer: open /var/lib/prometheus/01CMVWCBQQS6RTQMR7PVRSD1BR.tmp/chunks: too many open files"
Aug 14 05:31:06 aegis-03 boss[2388]: level=error ts=2018-08-14T09:31:06.309900608Z caller=consul.go:460 component="discovery manager scrape" discovery=consul msg="Error refreshing service" service=metrics tag=metrics err="Get http://j:8500/v1/catalog/service/metrics?stale=&tag=metrics&wait=30000ms: dial tcp: lookup f on 192.168.1.42:53: dial udp 192.168.1.42:53: socket: too many open files"
Aug 14 05:31:36 aegis-03 boss[2388]: level=error ts=2018-08-14T09:31:36.313416194Z caller=consul.go:460 component="discovery manager scrape" discovery=consul msg="Error refreshing service" service=metrics tag=metrics err="Get http://aegis-03.node.aegis:8500/v1/catalog/service/metrics?stale=&tag=metrics&wait=30000ms: dial tcp: lookup v on 192.168.1.42:53: dial udp 192.168.1.42:53: socket: too many open files"
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (11 by maintainers)
Well I can tell you that it isn’t expected 😉
Probably just a small bug somewhere on reload not causing old conns or clients to be closed before creating new ones.