prometheus: Incorrect very large values for time series

What did you do?

I am running a setup with two Prometheus instances in a cluster that remote write to a Thanos receiver setup. The Prometheus instances are run in agent mode and have a disk to persist the WAL cache. At random times one of the Prometheus instances start sending absurd values that are not possible.

What did you expect to see?

I expected to see normal reasonable values that are actually possible.

What did you see instead? Under which circumstances?

It seems to mostly affect the Prometheus meta metrics for the most parts but can also happen to other time series. Randomly a Prometheus instance will enter a weird mode where it starts sending unreasonable metrics values. I have drilled down and verified that the issue is not on the receiver end but on the writer side. It will only affects one of the Prometheus instances.

Here is an example of what happens. The graph is created from the following query max (prometheus_remote_storage_shards) by (tenant_id).

image

The example shows an issue related to meta metrics, but I have observed this behavior with other scraped metrics also, where all of a sudden the node count in a cluster is reported as 5k instead of 10 which is the actual value.

Environment

  • System information:

Linux 5.4.0-1067-azure x86_64

  • Prometheus version:

prometheus, version 2.33.5 (branch: HEAD, revision: 03831554a51946f28c7cdc6be7282c687092327b) build user: root@773960a14680 build date: 20220308-16:57:09 go version: go1.17.8 platform: linux/amd64

  • Alertmanager version:

N/A

  • Prometheus configuration file:
global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    cluster_name: <cluster-name>
    environment: <environment>
    prometheus: <prometheus>
    prometheus_replica: <prometheus-replica>
scrape_configs: <configs>
remote_write:
- url: <url>
  remote_timeout: 30s
  headers:
    THANOS-TENANT: <tenant>
  name: thanos
  tls_config:
    cert_file: /mnt/tls/tls.crt
    key_file: /mnt/tls/tls.key
    insecure_skip_verify: false
  follow_redirects: true
  queue_config:
    capacity: 3000
    max_shards: 100
    min_shards: 1
    max_samples_per_send: 1000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
  metadata_config:
    send: true
    send_interval: 1m
    max_samples_per_send: 500
  • Alertmanager configuration file:

N/A

  • Logs:

N/A

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 25 (6 by maintainers)

Most upvoted comments

I’ll do a comparison of Grafana Agent code vs Prometheus Agent code today to see what’s going on.

As an aside: ideally, long term, Grafana Agent will use Prometheus Agent code directly and won’t get out of sync with fixes. That’s not currently the case (we need to work on getting #10231 merged before we can do that), so the Grafana Agent team is currently a bit slow on catching issues like these.

Please reopen if 2.35 does not fix this issue.

Yes, 2.35 is being released now.

@ntimo looks like they are cutting a new release right now.

I do have the same issue, after trying multiple prometheus versions, I found that it works fine without using agent mode.