cortex: Cortex failing to handle samples out of nowhere
Description
I don’t understand what’s happening with my Cortex cluster. After several hours of ingesting samples and working fine it just starts failing to receive metrics. One host starts returning 5xx
and 4xx
errors for the push API route and using a lot of CPU.
This issue goes away once I restart the Prometheus instances.
Setup
- Prometheus
2.22.2
(docker) - 4x4 CPU 8 GB RAM
- Cortex
1.5.0
(binary) - 3x6 CPU 16 GB RAM
- Cassandra
3.11.9
(binary) - 3x6 CPU 16 GB RAM
Here is my Cortex config from one of my hosts: config.
The Cassandra hosts are beefy and quite healthy. Pretty sure that’s not the issue.
Details
If we look at the samples scraped by Prometheus we can see that the number is mostly stable ~13600
samples total:
And this matches with number of samples received by each of the Cortex nodes for a while. But then one of the nodes - in this case master-03
- starts slowly dropping in number of samples, and then has a sharp rise to ~85 K:
Labels per sample also raise:
At the same time we can see that Cortex Distributor is starting to return more and more 5xx
errors on master-03
:
And this correlates with the master-03
CPU usage being way above the other 2 nodes in Cortex cluster:
And we can see raise in failures in storing samples on Prometheus for that remote:
As well as jump in remote storage shards:
Questions
What is causing this? I did not change anything. It was working fine for a few hours and then suddenly issues started.
- Is this Prometheus sending samples incorrectly?
- Is it Cortex failing to handle the samples?
- Why is just one host in the cluster being overloaded? Shouldn’t the load be spread?
- What can I do to diagnose this further?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 23 (18 by maintainers)
Considering this is related specifically Cassandra I’d say BURN IT WITH FIRE.
I have abandoned Cassandra in favor of Block storage and it’s performing very well.
Truth be told considering the amount of time and money I wasted trying to get Cassandra to work as a backend I would recommend putting in deprecation warnings in documentation or even gradually dropping support, because it’s just absurd how poorly it performed. Maybe there’s a way to fine-tune it to make it work, but based on my experience it’s extremely difficult and just an absolute waste of time.
To anyone reading this: DO NOT USE CASSANDRA AS A CORTEX BACKEND
we have similar scenario