caddy: Prometheus metrics are redundant and slow

While digging into Caddy’s source code I’ve noticed that every (!) route handler is wrapped in a metricsInstrumentedHandler which updates Prometheus metrics during request execution. While it is a great feature and should definitely be enabled by default in Caddy, it currently uses up way to much CPU time and the metrics provided are quite redundant.

Since Caddy tries out every route handler in order until it gets an answer to the request, metric instrumentation is called for each handler, even if it didn’t actually partake (which is also quite difficult to define in this context) in resolving the request, so handler-specific metrics are constantly updated with unrelated data and as a result pretty much all of the handler-specific metrics are meaningless, making them only usable to track server-wide stats.

As an example, here are the metrics for a simple Caddy server with 2 hosts, 1 of which only has a reverse_proxy handler, and the other has 2 respond handlers, 1 reverse_proxy handler and 1 file_server handler. The metrics were taken after running an http load-testing tool on the endpoints. prometheus-metrics.txt As seen in the example, all of the handler-specific metrics are pretty much the same for all handlers, even though in reality only the respond handler was requested.

The handler metrics provide even less use if the web server hosts multiple domains, since requests from all domains get mixed up in the metrics. However, questionable metrics wouldn’t be much of an issue if they were as fast as providing server-wide metrics, but they, of course, aren’t, since they are getting updated multiple times until the request is finally answered.

I’ve ran pprof while putting load using h2load to request one of the simple respond handlers, and it turned out that 73% of the time spent during caddyhttp.Server.primaryHandlerChain.ServeHTTP was in the metricsInstrumentedHandler (only 30% of the time was spend by the actual caddyhttp.StaticResponse.ServeHTTP). Here’s the profile: profile.zip

I really think that metrics such as these should be server-wide where they make more sense and are quicker. https://github.com/nginxinc/nginx-prometheus-exporter could be seen as an example of similar Prometheus metrics commonly used for nginx.

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 6
Comments: 28 (15 by maintainers)

Commits related to this issue

caddyhttp: Make metrics opt-in Related to #4644 — committed to caddyserver/caddy by mholt 2 years ago
caddyhttp: Make metrics opt-in (#5042) * caddyhttp: Make metrics opt-in Related to #4644 * Make configurable in Caddyfile — committed to caddyserver/caddy by mholt 2 years ago

Most upvoted comments

Hi, guys! I’m really sorry about my awful response time and the “abandoned” state I’ve left this issue with. Turns out that working a full-time job and studying for a bachelor’s doesn’t leave you with much free time (lol). I’d be really glad to take this issue as well as some of the other optimizations I’ve implemented in a different PR and refactor/complete them.

I think we need to discuss, however, what is actually expected to be done in regards to the metrics. The last time I’ve tried working on this, it wasn’t really clear how we want metrics to work and look in the config. Also, I’ve tried to bite off more than I could chew in a single sprint of changes (implement reloadable metrics, configurable metrics, etc), so I think we need to decide what should be prioritized right now.

The main issue is the performance, right? I’ll check out the changes which have occurred in the codebase and see what can be optimized. Then we can probably move on to more feature-related metrics tasks which need doing

renbou on May 4, 2023

@mholt I’ve described a way to fix this as well as add various requested features related to the metrics in the discussion above, and began thinking of the way to actually implement this. I was planning to contribute a few other optimization-related PR’s first, but if 2.5 is coming up soon (I haven’t really seen a deadline), I can start working on this instead (probably closer to the weekend though)

renbou on Mar 29, 2022

Hello!
Has there been any progress on this matter? Since you mentioned continuing the discussion over on a Slack channel, I’m afraid stuff hasn’t been shared it yet.

Tmpod on Jul 28, 2023

We haven’t heard from @renbou in quite some time. I feel like the last we heard of them roughly coincides with the war in Ukraine (their profile mentions Saint-Petersburg, Russia) around March/April 2022. 🤔

francislavoie on Jan 19, 2023

The proposed solution should instrument the handler chain with the metrics middleware only once, globally, and the handler-specific information is going to be preserved by wrapping the handlers with a middleware which simply sets info about the handler (i.e. once the first time a handler returns its wrapper would set values in the request context signifying the name of the handler)

renbou on Mar 29, 2022

Because servers was added later, and some of those don’t really make sense to configure per server (we couldn’t think of a valid usecase where you’d want to configure different auto-https options per server).

francislavoie on Mar 28, 2022

Thanks for filing this @renbou - I’ll try to find some time this coming week to dig into it. My gut feeling is that some of the issue is due to the way middleware is handled and nested by Caddy, so it might be a bit more tricky to fully fix.

On the surface this is alarming:

[…] that 73% of the time spent during caddyhttp.Server.primaryHandlerChain.ServeHTTP was in the metricsInstrumentedHandler (only 30% of the time was spend by the actual caddyhttp.StaticResponse.ServeHTTP).

However percentages can sometimes be misleading, so it would be good to get some actual numbers… I haven’t looked at the profile yet (and I won’t this weekend - recovering from jet lag).

Caddy’s handling of static responses in particular is very fast, and so it’s not actually that surprising that the instrumentation surrounding it would account for more CPU time. As with all performance tuning, the most useful question isn’t “how fast can we make it go?” but instead “how fast does it need to be?” In SRE terms that might translate to “we have a latency SLO of 100ms 99% of the time, but we can’t reach that reliably if the metrics instrumentation takes 80ms”

hairyhenderson on Mar 19, 2022