prometheus: Data corruption using Prometheus Docker v2.0.0-alpha.2 image on NFS

What did you do? I started a prometheus task in a Docker swarm cluster. I started it with the next config:

prometheus:
    image: prom/prometheus:v2.0.0-alpha.2
    ports:
      - "9090:9090"
    networks:
      - monitoring
    volumes:
      - prometheus:/prometheus
    command: -config.file=/run/secrets/prometheus.yml -web.external-url=http://prometheus2.${CLUSTER_DOMAIN}
   secrets:
      - source: "monitoring_prometheus.yml"
        target: "prometheus.yml"
        uid: "0"
        gid: "0"
        mode: 0400
      - source: "monitoring_alert.rules_nodes"
        target: "alert.rules_nodes"
        uid: "0"
        gid: "0"
        mode: 0400
      - source: "monitoring_alert.rules_service-groups"
        target: "alert.rules_service-groups"
        uid: "0"
        gid: "0"
        mode: 0400
      - source: "monitoring_alert.rules_tasks"
        target: "alert.rules_tasks"
        uid: "0"
        gid: "0"
        mode: 0400
    deploy:
      mode: replicated
      replicas: 1
      resources:
        limits:
          cpus: '0.50'
          memory: 1024M
        reservations:
          cpus: '0.50'
          memory: 128M

Where these “secrets” are just the configuration file and alerts used by Prometheus.

What did you expect to see?

I expected to see it working. 😇

What did you see instead? Under which circumstances?

I’ve seen that the service was not running (as it was running as a “beta” service I didn’t have any monitoring over it) and when checking the logs I saw:

"time="2017-06-04T22:15:10Z" level=error msg="Opening storage failed: read meta information data/01BHMHDEMCSNG02KJKPC8M5YH0: open data/01BHMHDEMCSNG02KJKPC8M5YH0/meta.json: no such file or directory" source="main.go:89""

Environment

It’s running in a Swarm cluster, in AWS in EU-WEST-1 in 3 different AZs. It’s running as a swarm service with 1 task. The data is stored in a EFS system using rexray/efs plugin.

  • System information:

Linux 4.10.0-21-generic x86_64

  • Prometheus version:
prometheus, version 2.0.0-alpha.2 (branch: master, revision: ab0ce4a8d9858956b37b545cfb84bb4edb5d7776)
  build user:       root@fda0efffe2cf
  build date:       20170524-15:34:23
  go version:       go1.8.1
  • Prometheus configuration file:
global:
  scrape_interval:     30s
  evaluation_interval: 30s

  labels:
    cluster: swarm
    replica: "1"

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'prometheus'

rule_files:
  - "alert.rules_nodes"
  - "alert.rules_tasks"
  - "alert.rules_service-groups"

scrape_configs:
  - job_name: 'prometheus'
    dns_sd_configs:
    - names:
      - 'tasks.prometheus'
      type: 'A'
      port: 9090

  - job_name: 'cadvisor'
    dns_sd_configs:
    - names:
      - 'tasks.cadvisor'
      type: 'A'
      port: 8080

  - job_name: 'node-exporter'
    dns_sd_configs:
    - names:
      - 'tasks.node-exporter'
      type: 'A'
      port: 9100

  - job_name: 'docker-exporter'
    dns_sd_configs:
    - names:
      - 'tasks.docker-exporter'
      type: 'A'
      port: 4999

This means that I’m using built-in DNS service discovery in Swarm to autodiscover the task endpoint, to be consumed by Prometheus. It’s something I’ve been doing with previous versions.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 27 (10 by maintainers)

Most upvoted comments

We support working Posix filesystems, and recommend they be local for reliability and performance.

NFS is not known for being a working Posix filesystem.

Does the same happen without EFS?

We strongly recommend not using NFS or other networked filesystems.