thanos: receive: Stops storing data

Thanos, Prometheus and Golang version used:

thanos, version 0.14.0 (branch: master, revision: 70f89d837eebd672926663dd8876035860511f06)
  build user:       circleci@a770acd66205
  build date:       20200812-10:39:09
  go version:       go1.14.2

Object Storage Provider: MinIO

What happened: On one of our internal test systems, thanos receive will stop processing incoming data on regular intervals. The following thanos ui query shows metric node_cpu_seconds_total for the last week. This metric comes from a prometheus instance that is monitoring the OCP cluster:

As shown in the image, regular outages of 8 hours or more are occurring. The latest outage occurred on Feb 1, and lasted for 18 hours.

What you expected to happen: Thanos receive processes incoming metric data without error.

How to reproduce it (as minimally and precisely as possible): We are not sure what is causing it. Seems to occur periodically without any user intervention.

Full logs to relevant components:

Here’s what the thanos receive log showed around the time that it resumed accepting metrics:

Logs

level=warn ts=2021-02-01T23:28:47.88926126Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3384
level=warn ts=2021-02-01T23:28:47.895368012Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3313
level=warn ts=2021-02-01T23:29:46.425728194Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:46.427602034Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:46.428611741Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:48.662364042Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:48.668642017Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:48.669927944Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:51.384631506Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:51.38742069Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:51.397757579Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:56.5780011Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2

As shown in the log, thanos receive was basically throwing away everything, then suddenly started accepting metrics again. The following messages were observed two hours later:

Logs

level=info ts=2021-02-02T02:29:39.78375504Z caller=compact.go:494 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="write block" mint=1612222126000 maxt=1612224000000 ulid=01EXG95J4TWA4G17SYES7GNJKG duration=11.437024317s
level=info ts=2021-02-02T02:29:40.104937574Z caller=head.go:807 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="Head GC completed" duration=223.866704ms
level=info ts=2021-02-02T02:29:40.753078185Z caller=checkpoint.go:96 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="Creating checkpoint" from_segment=1118 to_segment=1120 mint=1612224000000
level=info ts=2021-02-02T02:29:41.063104019Z caller=head.go:887 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="WAL checkpoint complete" first=1118 last=1120 duration=311.134619ms
level=info ts=2021-02-02T02:30:02.423330728Z caller=shipper.go:333 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="upload new block" id=01EXG95J4TWA4G17SYES7GNJKG
level=info ts=2021-02-02T03:01:10.543246542Z caller=compact.go:494 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="write block" mint=1612224000000 maxt=1612231200000 ulid=01EXGAZ83MJGJCYRET8PF2E3E8 duration=11.930235283s

Anything else we need to know:

We tried to resolve the problem initially by restarting the thanos receive and thanos receive controller pods, but it didn’t help. We also tried restarting the memcached and store pods, but it had no affect. We then decided to leave the system as-is overnight, and found this morning that it had started to work again.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 24 (8 by maintainers)

Most upvoted comments

I’m running into the same issue as @jfg1701a and @jzangari were mentioning.

I have multiple prometheus instances (different clusters) using remote_write mechanism to forward their metrics to a shared receiver. If by any chance one of the cluster has an issue with time and is sending metrics in the future, receivers starts to complain and drop all incoming metrics, regardless of source cluster (each of them with different externalLabels).

Receiver logs:

level=warn ts=2023-02-27T12:54:57.827885697Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1663
level=warn ts=2023-02-27T12:54:57.937124259Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1648

Prometheus are receiveing 409 as reported:

ts=2023-02-27T13:26:39.223Z caller=dedupe.go:112 component=remote level=error remote_name=4ddd6d url=https://thanos-receiver.edgar-270222.staging.anywhere.navify.com/api/v1/receive msg="non-recoverable error" count=1451 exemplarCount=0 err="server returned HTTP status 409 Conflict: 3 errors: forwarding request to endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 524 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 469 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 458 samples: out of bounds"

Is there any plan for thanos receiver to implement a logic to prevent such issues from happening?

Currently if one of the sources has an issue, the rest of sources are also affected making Thanos not resilient to clock issues.

Thanos version: v0.30.2

edgrz on Feb 27, 2023