cortex: Distributor failing with 500s for no clear reason

Describe the bug I’m seeing random 500s when Prometheus is pushing metrics to /api/v1/push:

msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: rpc error: code = Unavailable desc = transport is closing"

Which looks like this on the Cortex side:

msg="POST /api/v1/push (500) 281.388644ms Response: \"rpc error: code = Unavailable desc = transport is closing\\n\" ws: false; Content-Encoding: snappy; Content-Length: 32409; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.26.0; X-Prometheus-Remote-Write-Version: 0.1.0; "

But it’s just a warn level message, and even with debug logs I see no reason for this error.

The number of samples being sent is tiny:

 > curl -s localhost:9090/metrics | grep "^prometheus_tsdb_head_series "
prometheus_tsdb_head_series 34294

And the hosts are VERY beefy and underutilized, so I’m really confused why this is happening

To Reproduce Not really sure. I’m happy to help debug this, but I’m not sure where to start.

Expected behavior Error should include reason for 500 error, but all it contains is rpc error: code = Unavailable desc = transport is closing.

Environment:

Infrastructure: Systemd service on Ubuntu
Version: 1.8.0

Storage Engine Chunks storage using Cassandra 3.11.9.

Additional Context I started getting a LOT of 500s suddenly, so I disabled all Prometheus instances except one to debug this, but the logs give me no indication as to why it’s actually happening. When I re-enable all other Prometheus instances the 500s keep raising until they overwhelm Cortex.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 26 (26 by maintainers)

Most upvoted comments

Internally Cortex uses gRPC to communicate between various components, even if those components run inside single binary.

If your Cortex cluster has very low usage, you may be running into problem described here: https://github.com/cortexproject/cortex/issues/3606#issuecomment-745179803

pstibrany on May 6, 2021