thanos: Thanos Receive: "Out of bounds metric"

Hello!

We run a handful of “edge” sites in a hub and spoke topology and feed metrics from remote promethei to a centralized receiver hash-ring. We see issues pop up quite often relating to “Out Of Bounds” metrics in receiver. The prometheus remote write will begin to return 500s to prometheus and Reciever latency climbs, impacting other tenants. We’re unsure if the metrics are too far in the future, or in the past, however we assume the past as it’s likely we could have experienced network or latency issues with shipment of remote metrics to cloud. It seems this is related to TSDB and Prometheus WAL limitations? https://github.com/prometheus/prometheus/issues/9607

The shipping prometheus itself is setup with the prometheus-operator, and has been configured with disableCompaction: true to ensure it doesn’t ship old blocks to Thanos, too (even though we’d LOVE (read: need) to have them…)

Worse? When prometheus fails to remote write, it just keeps retrying, getting into a loop with the receiver until the block is discarded. This is 100% not intended and there probably needs to be a feature/flag added on one side of the system here to help prevent this (as data caps are a real thing and this gets into 100s of GBs quick if unchecked).

Related and similar: https://github.com/thanos-io/thanos/issues/4767 - However we certainly have unique labels on our items and rarely are we getting 409 conflicts, but constantly 500s.

Thanos Version: 0.22.0 Prometheus Version: 2.30.3

Object Storage Provider: GCP

What happened: Thanos Receive will on occasion get into a fit with the samples a prometheus sends up to it for storage throwing Out of Bounds errors, causing cascading issues for the whole Thanos ingestion of metrics globally within the entire system.

What you expected to happen: Thanos Receive accepts the samples and stores them in the bucket without issue.

How to reproduce it (as minimally and precisely as possible): We see this most with shipments of kube-state-metrics, however you can get any prometheus into this issue with a receiver if you wait long enough.

Full logs to relevant components: Receiver Logs:

component=receive component=receive-writer msg="Error on ingesting out-of-order samples" numDropped=47
component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"apiserver_request_duration_seconds_bucket\", component=\"apiserver\", endpoint=\"https-metrics\", env=\"prd\", group=\"admissionregistration.k8s.io\", instance=\"172.22.3.33:10250\", job=\"kubelet\", le=\"0.45\", metrics_path=\"/metrics\", namespace=\"kube-system\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-0\", resource=\"validatingwebhookconfigurations\", scope=\"cluster\", service=\"kubelet\", verb=\"WATCH\", version=\"v1\"}" sample="unsupported value type"

Prometheus Logs:

component=remote level=warn remote_name=319b4e url=http://cloud-gateway:10908/api/v1/receive msg="Failed to send batch, retrying" err="Post \"http://cloud-gateway:10908/api/v1/receive\": context deadline exceeded"

Work Around: Kill all reporting promethei and their relevant statefulsets, then Receive is happy, but hopefully this can help illustrate how this isn’t scalable given (N) tenants 😦

Screen Shot 2021-11-03 at 11 28 25 AM

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 5
  • Comments: 43 (24 by maintainers)

Most upvoted comments

Hey folks - I wanted to chime in here with some relevant observations. Although I could write at length about this topic, I will keep it brief. Historically when we did deployments on our receive cluster (specifically the ingesters) - we would see all sorts of errors (500/503/409). The logs revealed several types of errors on ingestion from “Out of Bounds” to “Already Exists”. However I noticed that when the deployment began we’d also see errors on the routers claiming that quorum wasn’t reached. This always perplexed me since our replication factor was 3 and rollouts on the ingester statefulsets would happen one at a time and appropriate probes (startup, readiness and liveness) prevented the deployment from progression until the previous ingester was ready to receive traffic. So theoretically quorum should in fact have been reached but it wasn’t. I then considered the possibility that there is a race condition and decided to update the ingester StatefulSets with the spec/minReadySeconds: 120 property (read more here), which adds some amount of delay in between consecutive deployments of pods in a Statefulset. After making this change - I am no longer seeing ANY errors during rollouts! The quorum is maintained (validated by enabling debug logging on the routers) and the thanos receive routers respond to all remote write requests with status code 200. I am still investigating what is the cause of this race condition - but ever since we added this buffer in between the rollouts our ingest pipeline has been operating smoothly and our on call schedule is silent 😄

Hit this issue quite a lot recently.

Short term we should try to check if receiver response is correct. OutOfBounds is unrecoverable. Do we also tell this back to Prometheus? That is at least to resolve the infinite loop of retries.

Also, this happens constantly when rolling out updates to receivers as we can see while it takes one down to replace and the hash ring can no longer contact a member of the ring:

level=debug ts=2021-11-05T19:31:59.654455194Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.654733445Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.654990715Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.655249505Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.655500684Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.655757974Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.676009613Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_deployment_status_observed_generation\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", deployment=\"rfs-edge-redis\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"default\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.676375092Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_endpoint_labels\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_endpoint=\"kubernetes\", exported_namespace=\"default\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.676560331Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_pod_owner\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"vault-secrets-operator\", exported_pod=\"vault-secrets-operator-59769bc489-m4gwf\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", owner_is_controller=\"true\", owner_kind=\"ReplicaSet\", owner_name=\"vault-secrets-operator-59769bc489\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", uid=\"777a319a-a2a5-4c54-929d-fb1a10463262\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.676754415Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_pod_status_scheduled\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", condition=\"true\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"default\", exported_pod=\"server-5c7dfb95b6-n8pq5\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", uid=\"dc10dc19-ed31-476b-9204-f2b55339fe68\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.67690643Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_replicaset_spec_replicas\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"default\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", replicaset=\"other-server-7f65547c6\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.677049925Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_secret_annotations\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"monitoring\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", secret=\"skupper-local-ca\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", version=\"v2\"}" sample="unsupported value type"
level=warn ts=2021-11-05T19:31:59.67757962Z caller=writer.go:131 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" numDropped=6
level=debug ts=2021-11-05T19:31:59.677975065Z caller=handler.go:491 component=receive component=receive-handler
level=debug ts=2021-11-05T19:31:59.680275619Z caller=handler.go:656 component=receive component=receive-handler msg="failed to handle request" err="store locally for endpoint thanos-receive-1.thanos-receive.monitoring

That’s correct, but in that PR we revamped the way we calculate quorum. I suspect it should help, but I am not 100% sure. Do you have a way to reproduce this issue somehow?

Hey folks - just chiming in here. I am a colleague of @jmichalek132 and can confirm that we are see this behavior happening in our environment even after the move to running thanos receive in router and ingestor mode. The frequency is very irregular and the behavior seems to only affect a prometheus instances in a particular environment while instances in other environments writing to the same endpoint see little to no degradation in their remote write.

We typically see increased latency and timeouts on the router. However the ingestors do not exhibit and substantial latency. This was also validated by forking thanos and adding more instrumentation(spans) around the TSDB write.

If you have any pointers as to what is worth investigating next, we are all ears.

We only have that for things in main https://quay.io/repository/thanos/thanos?tab=tags - I’ve pinged to get a review, so perhaps we can move this forward faster.

I could reproduce this by fiddling around with my local time, causing;

level=debug name=receive ts=2022-02-05T22:00:12.323668Z caller=writer.go:94 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"prometheus_tsdb_head_gc_duration_seconds_sum\", instance=\"localhost:9090\", job=\"prometheus\", prometheus=\"prom0\", replica=\"1\"}" sample="unsupported value type"

What I however not could reproduce is the behaviour from Prometheus side. Perhaps this is due to the fact I’m using an older Prometheus version because I’m having some issues installing prometheus on my M1 lol

After giving thanos receive double the resources as well as breaking it down into a router/ingestor - we still see this issue using the latest, 0.24.0.

Whats interesting is it will begin happening with a single tenant (picture below) and then cascade over time to ALL tenants reporting to receive.

If we’re not the only one having this issue, this seems to be a massive issue that will happen to any receive user over time.

Screen Shot 2022-01-21 at 10 27 51 AM

Hit this issue quite a lot recently.

We used to hit it too, but at least since switching to the receive router & ingestor mode we haven’t hit it since. Not sure however 100% that it’s related, but it might be worth trying if you are not using that mode yet. Details on our setup here.

It can be related if your receive is struggling to keep up I’m sure.

For this issue, what I really would like is an easy-to-replicate setup for my dev environment. There was a related issue, which I now cannot find anymore. What I stated there is that I believe we do not handle the HTTP code correctly. The short version is that Prometheus basically could return two things; 1: “error, try again” and 2: “error, unrecoverable”. - I think there is a chance we handle case 2 as 1, causing basically a DoS until one resets Prometheus again.

exactly, the Prometheus team claims it handles the HTTP code properly and that its on the thanos receive side: https://github.com/prometheus/prometheus/pull/5649

Going to peel apart our thanos config into the router and ingestor and report back.