cortex: Cortex failing to handle samples out of nowhere

Description

I don’t understand what’s happening with my Cortex cluster. After several hours of ingesting samples and working fine it just starts failing to receive metrics. One host starts returning 5xx and 4xx errors for the push API route and using a lot of CPU.

This issue goes away once I restart the Prometheus instances.

Setup

  • Prometheus 2.22.2 (docker) - 4x 4 CPU 8 GB RAM
  • Cortex 1.5.0 (binary) - 3x 6 CPU 16 GB RAM
  • Cassandra 3.11.9 (binary) - 3x 6 CPU 16 GB RAM

Here is my Cortex config from one of my hosts: config.

The Cassandra hosts are beefy and quite healthy. Pretty sure that’s not the issue.

Details

If we look at the samples scraped by Prometheus we can see that the number is mostly stable ~13600 samples total:

prometheus_appended_samples_per_sec

And this matches with number of samples received by each of the Cortex nodes for a while. But then one of the nodes - in this case master-03 - starts slowly dropping in number of samples, and then has a sharp rise to ~85 K:

cortex_samples_total_10m_rate

Labels per sample also raise:

cortex_labels_per_sample_raise

At the same time we can see that Cortex Distributor is starting to return more and more 5xx errors on master-03:

cortex_request_responses_5xx_raise

And this correlates with the master-03 CPU usage being way above the other 2 nodes in Cortex cluster:

cortex_hosts_cpu_usage_raise

And we can see raise in failures in storing samples on Prometheus for that remote:

prometheus_samples_issues

As well as jump in remote storage shards:

prometheus_storage_shards

Questions

What is causing this? I did not change anything. It was working fine for a few hours and then suddenly issues started.

  • Is this Prometheus sending samples incorrectly?
  • Is it Cortex failing to handle the samples?
  • Why is just one host in the cluster being overloaded? Shouldn’t the load be spread?
  • What can I do to diagnose this further?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 23 (18 by maintainers)

Most upvoted comments

Considering this is related specifically Cassandra I’d say BURN IT WITH FIRE.

I have abandoned Cassandra in favor of Block storage and it’s performing very well.

Truth be told considering the amount of time and money I wasted trying to get Cassandra to work as a backend I would recommend putting in deprecation warnings in documentation or even gradually dropping support, because it’s just absurd how poorly it performed. Maybe there’s a way to fine-tune it to make it work, but based on my experience it’s extremely difficult and just an absolute waste of time.

To anyone reading this: DO NOT USE CASSANDRA AS A CORTEX BACKEND

we have similar scenario

image