thanos: receive: Stops storing data
Thanos, Prometheus and Golang version used:
thanos, version 0.14.0 (branch: master, revision: 70f89d837eebd672926663dd8876035860511f06)
build user: circleci@a770acd66205
build date: 20200812-10:39:09
go version: go1.14.2
Object Storage Provider: MinIO
What happened: On one of our internal test systems, thanos receive will stop processing incoming data on regular intervals. The following thanos ui query shows metric node_cpu_seconds_total for the last week. This metric comes from a prometheus instance that is monitoring the OCP cluster:
As shown in the image, regular outages of 8 hours or more are occurring. The latest outage occurred on Feb 1, and lasted for 18 hours.
What you expected to happen: Thanos receive processes incoming metric data without error.
How to reproduce it (as minimally and precisely as possible): We are not sure what is causing it. Seems to occur periodically without any user intervention.
Full logs to relevant components:
Here’s what the thanos receive log showed around the time that it resumed accepting metrics:
level=warn ts=2021-02-01T23:28:47.88926126Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3384
level=warn ts=2021-02-01T23:28:47.895368012Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3313
level=warn ts=2021-02-01T23:29:46.425728194Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:46.427602034Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:46.428611741Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:48.662364042Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:48.668642017Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:48.669927944Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:51.384631506Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:51.38742069Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:51.397757579Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:56.5780011Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
As shown in the log, thanos receive was basically throwing away everything, then suddenly started accepting metrics again. The following messages were observed two hours later:
level=info ts=2021-02-02T02:29:39.78375504Z caller=compact.go:494 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="write block" mint=1612222126000 maxt=1612224000000 ulid=01EXG95J4TWA4G17SYES7GNJKG duration=11.437024317s
level=info ts=2021-02-02T02:29:40.104937574Z caller=head.go:807 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="Head GC completed" duration=223.866704ms
level=info ts=2021-02-02T02:29:40.753078185Z caller=checkpoint.go:96 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="Creating checkpoint" from_segment=1118 to_segment=1120 mint=1612224000000
level=info ts=2021-02-02T02:29:41.063104019Z caller=head.go:887 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="WAL checkpoint complete" first=1118 last=1120 duration=311.134619ms
level=info ts=2021-02-02T02:30:02.423330728Z caller=shipper.go:333 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="upload new block" id=01EXG95J4TWA4G17SYES7GNJKG
level=info ts=2021-02-02T03:01:10.543246542Z caller=compact.go:494 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="write block" mint=1612224000000 maxt=1612231200000 ulid=01EXGAZ83MJGJCYRET8PF2E3E8 duration=11.930235283s
Anything else we need to know:
We tried to resolve the problem initially by restarting the thanos receive and thanos receive controller pods, but it didn’t help. We also tried restarting the memcached and store pods, but it had no affect. We then decided to leave the system as-is overnight, and found this morning that it had started to work again.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 24 (8 by maintainers)
I’m running into the same issue as @jfg1701a and @jzangari were mentioning.
I have multiple prometheus instances (different clusters) using
remote_write
mechanism to forward their metrics to a shared receiver. If by any chance one of the cluster has an issue with time and is sending metrics in the future, receivers starts to complain and drop all incoming metrics, regardless of source cluster (each of them with differentexternalLabels
).Receiver logs:
Prometheus are receiveing 409 as reported:
Is there any plan for
thanos receiver
to implement a logic to prevent such issues from happening?Currently if one of the sources has an issue, the rest of sources are also affected making Thanos not resilient to clock issues.
Thanos version: v0.30.2