prometheus: Data corruption using Prometheus Docker v2.0.0-alpha.2 image on NFS
What did you do? I started a prometheus task in a Docker swarm cluster. I started it with the next config:
prometheus:
image: prom/prometheus:v2.0.0-alpha.2
ports:
- "9090:9090"
networks:
- monitoring
volumes:
- prometheus:/prometheus
command: -config.file=/run/secrets/prometheus.yml -web.external-url=http://prometheus2.${CLUSTER_DOMAIN}
secrets:
- source: "monitoring_prometheus.yml"
target: "prometheus.yml"
uid: "0"
gid: "0"
mode: 0400
- source: "monitoring_alert.rules_nodes"
target: "alert.rules_nodes"
uid: "0"
gid: "0"
mode: 0400
- source: "monitoring_alert.rules_service-groups"
target: "alert.rules_service-groups"
uid: "0"
gid: "0"
mode: 0400
- source: "monitoring_alert.rules_tasks"
target: "alert.rules_tasks"
uid: "0"
gid: "0"
mode: 0400
deploy:
mode: replicated
replicas: 1
resources:
limits:
cpus: '0.50'
memory: 1024M
reservations:
cpus: '0.50'
memory: 128M
Where these “secrets” are just the configuration file and alerts used by Prometheus.
What did you expect to see?
I expected to see it working. 😇
What did you see instead? Under which circumstances?
I’ve seen that the service was not running (as it was running as a “beta” service I didn’t have any monitoring over it) and when checking the logs I saw:
"time="2017-06-04T22:15:10Z" level=error msg="Opening storage failed: read meta information data/01BHMHDEMCSNG02KJKPC8M5YH0: open data/01BHMHDEMCSNG02KJKPC8M5YH0/meta.json: no such file or directory" source="main.go:89""
Environment
It’s running in a Swarm cluster, in AWS in EU-WEST-1 in 3 different AZs.
It’s running as a swarm service with 1 task.
The data is stored in a EFS system using rexray/efs
plugin.
- System information:
Linux 4.10.0-21-generic x86_64
- Prometheus version:
prometheus, version 2.0.0-alpha.2 (branch: master, revision: ab0ce4a8d9858956b37b545cfb84bb4edb5d7776)
build user: root@fda0efffe2cf
build date: 20170524-15:34:23
go version: go1.8.1
- Prometheus configuration file:
global:
scrape_interval: 30s
evaluation_interval: 30s
labels:
cluster: swarm
replica: "1"
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'prometheus'
rule_files:
- "alert.rules_nodes"
- "alert.rules_tasks"
- "alert.rules_service-groups"
scrape_configs:
- job_name: 'prometheus'
dns_sd_configs:
- names:
- 'tasks.prometheus'
type: 'A'
port: 9090
- job_name: 'cadvisor'
dns_sd_configs:
- names:
- 'tasks.cadvisor'
type: 'A'
port: 8080
- job_name: 'node-exporter'
dns_sd_configs:
- names:
- 'tasks.node-exporter'
type: 'A'
port: 9100
- job_name: 'docker-exporter'
dns_sd_configs:
- names:
- 'tasks.docker-exporter'
type: 'A'
port: 4999
This means that I’m using built-in DNS service discovery in Swarm to autodiscover the task endpoint, to be consumed by Prometheus. It’s something I’ve been doing with previous versions.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 27 (10 by maintainers)
We support working Posix filesystems, and recommend they be local for reliability and performance.
NFS is not known for being a working Posix filesystem.
Does the same happen without EFS?
We strongly recommend not using NFS or other networked filesystems.