caddy: caddy_http_requests_in_flight metric slowly rising
Hello everyone,
I’m monitoring caddy with the /metrics endpoint and noticed that there is an increase in the number that caddy_http_requests_in_flight is reporting over time. I think this number should return to zero when there are no requests in flight, but it doesn’t:
The used query is avg(avg_over_time(caddy_http_requests_in_flight{handler="[[handler]]", server="[[server]]"}[1m])).
Maybe someone knows what’s causing this. Thank you all for the great work on caddy!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 22 (9 by maintainers)
I‘ve kept an eye on this for a couple of days now and setting the timeouts in the php_fastcgi directive solves this problem 👍🏻
Right, Caddy talks to a php-fpm container that executes the Moodle scripts. The compose project looks like this:
The php container is based on the bitnami php-fpm container with some programs added that Moodle needs for converting submissions:
In fact I did not set these on purpose, as there are already limits set in the php-fpm configuration (although they are set quite high as there are some people uploading files over a slow connection):
This should mean that the maximum time that a connection between caddy and the php-fpm container is open is 30 minutes. I can try to set the limits in the caddy file to 30 minutes too to see if that makes a difference.
I guess you’re on the right track here: for testing purposes I restarted the php container and voilà:
Does this mean that caddy does not realize that a client disconnected? Because from the load balancer I know that there aren’t that many open connections to caddy. I’d expect the metric to decrease as soon as the request is done, i.e. either it is successfully served or the client disconnected for other reasons. But in the latter case it seems like the request stays in an “open” state. I do not think that caddy is waiting for a response from php, as this should time out after
max_execution_timeseconds and by no later than that, the request should be done…Off Topic:
This may be true, but what’s much less efficient most of the time are the php scripts. Kind of weird to need four machines to serve 80 req/s… (Off Topic: I’m running four nodes, with caddy + php, 4x2 vCPUs, as this is cheaper than using a beefier machine with this use case) I know that one can use fastcgi for other languages too, I used to run caddy 1 with a python fastcgi-wsgi wrapper, but I guess most people that use the php_fastcgi directive actually use php and I’ve never seen caddy being a bottleneck there.
Here an image where the node is removed from the load balancer for quite some time. Caddy reports zero requests (all handlers), but requests in flight are still reported as 40:
The only other “server” in caddy-language is the one that provides the prometheus /metrics endpoint. There it looks as expected:
Nothing that I’d call “exceedingly high”, but I’ll attach the stack dump anyway. This machine currently reports 40 requests in flight as baseline when nothing happens. I also quickly removed the machine from the load balancer, so there were no requests during that time, but the metric still reported 40.
Edit: I also attached the other files that reported non-zero numbers. goroutine.txt allocs.txt heap.txt threadcreate.txt
I also thought about this, but as I’m only using these plugins with the load balanced setup and also experience this with the single server (I’ve included the Dockerfile here: https://github.com/caddyserver/caddy/issues/4009#issuecomment-772680853), I’m quite sure that this is not the issue here.
I think you’re right. There are about 30000 entries in the log with “stream closed”, way more than the increase in the metric. And I also cannot really see a connection to the timestamps.
I’ll try this, but I can already say that the single server setup does not have HTTP/3 enabled, so I do not think that this is the cause.
Some more information: I do not think that the metric is correct. Normally I’d expect this to go back to zero when there aren’t any active requests. From the load balancer stats I can see that the number of open connections goes back to basically zero at least during the night, so there shouldn’t be any open connections any more during that time:
(First graph shows open connections, second graph new connections / second)