prometheus-plugin: Monotonic counter decreases when builds expire

Jenkins and plugins versions report

Environment
Paste the output here

What Operating System are you using (both controller, and any agents involved in the problem)?

AmazonLinux2

Reproduction steps

Create a build and run it a few times with both successes and failures Configure Jenkins to expire/discard old builds Observe that metrics that are counters go down (successful and failed builds for example)

Expected Results

Counter. A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

Actual Results

The counters go down, which causes alerts based on these values to fire.

Anything else?

Reference to closed (but unresolved) issue: https://github.com/jenkinsci/prometheus-plugin/issues/327

Prometheus plugin version: 2.1.2 Jenkins version: 2.375.4

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18

Most upvoted comments

Hi @adriengardou watching this YouTube video by the Cofounder of Prometheus will explain a lot as far as how counter resets work. Prometheus does a lot of the heavy lifting for you so you don’t have to worry about avoiding decreases on restart and saving state. If you start watching at 2:30 he explains how increase/rate functions are calculated when a counter is reset. https://youtu.be/7uy_yovtyqw?t=156&si=8NArI6RSew0TACGz.

The idea in my PR is to basically not save the state of the counter at all and just let it reset to zero upon restart. Prometheus will automatically account for the reset on the next scrape since the counter decreased. If I understand your PR correctly, it tries to get around the issue by caching old jobs that have been discarded so the counter is always up to date. This works fine, assuming the instance doesn’t restart, and the counter continuously increases. Since the existing implementation and your PR looks at job history to determine the counter value, I believe it could cause some unwanted side effects that I will describe below.

Example: You have a job that ran 15 builds but had discarded 5 of them, the counter before the instance is reset will show the correct counter value 15 since your PR adds the job cache. Upon restart though the cache will get wiped and we will scrape the job again. This will update the counter to 10 since we are no longer accounting for the discarded builds. Since Prometheus saw the value drop from 15 to 10 it will assume there was a counter reset. When it performs a rate calculation it will adjust for the reset, but instead of Prometheus calculating 15 on a reset to zero it will calculate 15 + 10 since 10 was lower than the previous counter value. I hope this explanation is helpful, but if you need any more clarification, I would be happy to do so.

I opened a PR as draft to start discussions. Still need to be rebased on newer commits and follow contribution rules.

Thanks for working on this @adriengardou 👍

Now that @Waschndolos explained how the plugin works, indeed we need a way to store a state to avoid loosing information when builds disappear.

I wondered if storing this state will not be possible using a Prometheus record rule, based on the raw metric… This would allow to store the state in Prometheus itself, to not loose this state upon restart and to simplify Prometheus requests to have a ‘real’ counter (i.e. that always increases)

I am not a Prometheus expert so I do not know if such a record is possible to write (what would be the computed request that would allow store the state over the time) but I am curious about your opinion about this.

Thank you @Waschndolos for your effort! The issue can be closed.

My goal was to have a per day aggregation so I’ve done that with some plain groovy. A single job that does the query on the pipelines, retrieve the build time and then push it as a metric in Prometheus.

Have a good one!