thanos: Compactor: Does not exit on error

Thanos, Prometheus and Golang version used: 17.2

Object Storage Provider: s3

What happened: Compactor got an error but did not get killed and does not continue

What you expected to happen: Compactor exits so it can be restarted or continues regardless

How to reproduce it (as minimally and precisely as possible): n/a

Full logs to relevant components:

Logs

level=info ts=2021-03-18T13:35:24.13386452Z caller=clean.go:33 msg="started cleaning of aborted partial uploads"
level=info ts=2021-03-18T13:35:24.133906785Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-03-18T13:35:24.13391986Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-03-18T13:35:24.133930049Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-03-18T13:35:29.026173574Z caller=fetcher.go:458 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=3.499913081s cached=5974 returned=5974 partial=0
level=error ts=2021-03-18T13:36:28.794804266Z caller=runutil.go:99 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: Access Denied"
level=error ts=2021-03-18T13:37:28.70842268Z caller=runutil.go:99 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: Access Denied"
level=warn ts=2021-03-18T13:38:24.507549166Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:24.507583914Z caller=http.go:65 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:25.007714323Z caller=http.go:84 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:25.007758137Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: BaseFetcher: iter bucket: Access Denied"

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (8 by maintainers)

Most upvoted comments

We’ve hit this with v0.21.1 running on Kubernetes against a locally-hosted S3 (Ceph with radosgw):

level=info ts=2021-07-26T08:49:29.403830908Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-07-26T08:49:29.403850994Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-07-26T08:49:29.403872117Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-07-26T08:50:29.12294803Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=332.354878ms cached=924 returned=924 partial=0
level=error ts=2021-07-26T08:56:28.79212747Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=warn ts=2021-07-26T08:58:19.271976663Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.272120069Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=error ts=2021-07-26T08:58:19.273050359Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context canceled"
level=info ts=2021-07-26T08:58:19.273219033Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.273261935Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"

After this the process was not doing anything anymore (as described above) and was not respond to SIGTERM either.

By default, compactor does not crash on halt errors. there is a hidden flag you can change it.

https://thanos.io/tip/components/compact.md/#halting

i also met same issue and i am still investigating.

We’ve hit this with v0.21.1 running on Kubernetes against a locally-hosted S3 (Ceph with radosgw):

level=info ts=2021-07-26T08:49:29.403830908Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-07-26T08:49:29.403850994Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-07-26T08:49:29.403872117Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-07-26T08:50:29.12294803Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=332.354878ms cached=924 returned=924 partial=0
level=error ts=2021-07-26T08:56:28.79212747Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=warn ts=2021-07-26T08:58:19.271976663Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.272120069Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=error ts=2021-07-26T08:58:19.273050359Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context canceled"
level=info ts=2021-07-26T08:58:19.273219033Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.273261935Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"

After this the process was not doing anything anymore (as described above) and was not respond to SIGTERM either.

Another thing is that we should only turn off the HTTP server at the end of everything to permit debugging via pprof in cases such as this.