prometheus: Incorrect very large values for time series
What did you do?
I am running a setup with two Prometheus instances in a cluster that remote write to a Thanos receiver setup. The Prometheus instances are run in agent mode and have a disk to persist the WAL cache. At random times one of the Prometheus instances start sending absurd values that are not possible.
What did you expect to see?
I expected to see normal reasonable values that are actually possible.
What did you see instead? Under which circumstances?
It seems to mostly affect the Prometheus meta metrics for the most parts but can also happen to other time series. Randomly a Prometheus instance will enter a weird mode where it starts sending unreasonable metrics values. I have drilled down and verified that the issue is not on the receiver end but on the writer side. It will only affects one of the Prometheus instances.
Here is an example of what happens. The graph is created from the following query max (prometheus_remote_storage_shards) by (tenant_id)
.
The example shows an issue related to meta metrics, but I have observed this behavior with other scraped metrics also, where all of a sudden the node count in a cluster is reported as 5k instead of 10 which is the actual value.
Environment
- System information:
Linux 5.4.0-1067-azure x86_64
- Prometheus version:
prometheus, version 2.33.5 (branch: HEAD, revision: 03831554a51946f28c7cdc6be7282c687092327b) build user: root@773960a14680 build date: 20220308-16:57:09 go version: go1.17.8 platform: linux/amd64
- Alertmanager version:
N/A
- Prometheus configuration file:
global:
scrape_interval: 30s
scrape_timeout: 10s
evaluation_interval: 30s
external_labels:
cluster_name: <cluster-name>
environment: <environment>
prometheus: <prometheus>
prometheus_replica: <prometheus-replica>
scrape_configs: <configs>
remote_write:
- url: <url>
remote_timeout: 30s
headers:
THANOS-TENANT: <tenant>
name: thanos
tls_config:
cert_file: /mnt/tls/tls.crt
key_file: /mnt/tls/tls.key
insecure_skip_verify: false
follow_redirects: true
queue_config:
capacity: 3000
max_shards: 100
min_shards: 1
max_samples_per_send: 1000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
metadata_config:
send: true
send_interval: 1m
max_samples_per_send: 500
- Alertmanager configuration file:
N/A
- Logs:
N/A
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 25 (6 by maintainers)
I’ll do a comparison of Grafana Agent code vs Prometheus Agent code today to see what’s going on.
As an aside: ideally, long term, Grafana Agent will use Prometheus Agent code directly and won’t get out of sync with fixes. That’s not currently the case (we need to work on getting #10231 merged before we can do that), so the Grafana Agent team is currently a bit slow on catching issues like these.
Please reopen if 2.35 does not fix this issue.
Yes, 2.35 is being released now.
@ntimo looks like they are cutting a new release right now.
I do have the same issue, after trying multiple prometheus versions, I found that it works fine without using agent mode.