prometheus: Incorrect very large values for time series

What did you do?

I am running a setup with two Prometheus instances in a cluster that remote write to a Thanos receiver setup. The Prometheus instances are run in agent mode and have a disk to persist the WAL cache. At random times one of the Prometheus instances start sending absurd values that are not possible.

What did you expect to see?

I expected to see normal reasonable values that are actually possible.

What did you see instead? Under which circumstances?

It seems to mostly affect the Prometheus meta metrics for the most parts but can also happen to other time series. Randomly a Prometheus instance will enter a weird mode where it starts sending unreasonable metrics values. I have drilled down and verified that the issue is not on the receiver end but on the writer side. It will only affects one of the Prometheus instances.

Here is an example of what happens. The graph is created from the following query max (prometheus_remote_storage_shards) by (tenant_id).

The example shows an issue related to meta metrics, but I have observed this behavior with other scraped metrics also, where all of a sudden the node count in a cluster is reported as 5k instead of 10 which is the actual value.

Environment

System information:

Linux 5.4.0-1067-azure x86_64

Prometheus version:

prometheus, version 2.33.5 (branch: HEAD, revision: 03831554a51946f28c7cdc6be7282c687092327b) build user: root@773960a14680 build date: 20220308-16:57:09 go version: go1.17.8 platform: linux/amd64

Alertmanager version:

N/A

Prometheus configuration file:

global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    cluster_name: <cluster-name>
    environment: <environment>
    prometheus: <prometheus>
    prometheus_replica: <prometheus-replica>
scrape_configs: <configs>
remote_write:
- url: <url>
  remote_timeout: 30s
  headers:
    THANOS-TENANT: <tenant>
  name: thanos
  tls_config:
    cert_file: /mnt/tls/tls.crt
    key_file: /mnt/tls/tls.key
    insecure_skip_verify: false
  follow_redirects: true
  queue_config:
    capacity: 3000
    max_shards: 100
    min_shards: 1
    max_samples_per_send: 1000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
  metadata_config:
    send: true
    send_interval: 1m
    max_samples_per_send: 500

Alertmanager configuration file:

N/A

Logs:

N/A

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 25 (6 by maintainers)

Most upvoted comments

I’ll do a comparison of Grafana Agent code vs Prometheus Agent code today to see what’s going on.

As an aside: ideally, long term, Grafana Agent will use Prometheus Agent code directly and won’t get out of sync with fixes. That’s not currently the case (we need to work on getting #10231 merged before we can do that), so the Grafana Agent team is currently a bit slow on catching issues like these.

rfratto on Apr 13, 2022

Please reopen if 2.35 does not fix this issue.

roidelapluie on Apr 22, 2022

Yes, 2.35 is being released now.

roidelapluie on Apr 21, 2022

@ntimo looks like they are cutting a new release right now.

phillebaba on Apr 21, 2022

I do have the same issue, after trying multiple prometheus versions, I found that it works fine without using agent mode.

liujing1087 on Apr 7, 2022