caddy: caddy_http_requests_in_flight metric slowly rising

Hello everyone,

I’m monitoring caddy with the /metrics endpoint and noticed that there is an increase in the number that caddy_http_requests_in_flight is reporting over time. I think this number should return to zero when there are no requests in flight, but it doesn’t:

image

The used query is avg(avg_over_time(caddy_http_requests_in_flight{handler="[[handler]]", server="[[server]]"}[1m])).

Maybe someone knows what’s causing this. Thank you all for the great work on caddy!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (9 by maintainers)

Most upvoted comments

I‘ve kept an eye on this for a couple of days now and setting the timeouts in the php_fastcgi directive solves this problem 👍🏻

There seems to be a lot of goroutines and allocations in the FastCGI client – what’s your PHP backend like? You said it’s Moodle right? It looks like it’s holding connections open.

Right, Caddy talks to a php-fpm container that executes the Moodle scripts. The compose project looks like this:

version: '3.3'

services:
  caddy:
    image: registry.gitlab.com/steffend/jvg-moodle/caddy
    restart: always
    network_mode: host
    environment:
      - CLOUDFLARE_API_TOKEN=${CLOUDFLARE_API_TOKEN}
    volumes:
      - ./caddy:/var/www
      - /var/log:/var/log
      - /mnt/moodle:/mnt/moodle:rshared
      - /mnt/moodle_files:/mnt/moodle_files:rshared
    command: caddy run --config /var/www/Caddyfile
    cap_add:
      - NET_BIND_SERVICE
  php:
    image: registry.gitlab.com/steffend/jvg-moodle/php-fpm
    volumes:
      - ./php/php.ini:/opt/bitnami/php/etc/conf.d/custom.ini
      - ./php/www.conf:/opt/bitnami/php/etc/php-fpm.d/www.conf
      - /mnt/moodle:/mnt/moodle:rshared
      - /mnt/moodle_files:/mnt/moodle_files:rshared
    ports:
     - 127.0.0.1:9000:9000
    restart: always

The php container is based on the bitnami php-fpm container with some programs added that Moodle needs for converting submissions:

FROM bitnami/php-fpm:7.3-prod

RUN install_packages ghostscript unoconv autoconf build-essential \
 && pecl install redis \
 && echo "extension=redis.so" >> /opt/bitnami/php/etc/php.ini

CMD [ "php-fpm", "-F", "--pid", "/opt/bitnami/php/tmp/php-fpm.pid", "-y", "/opt/bitnami/php/etc/php-fpm.conf" ]

You might try setting read and write timeouts: https://caddyserver.com/docs/caddyfile/directives/php_fastcgi

In fact I did not set these on purpose, as there are already limits set in the php-fpm configuration (although they are set quite high as there are some people uploading files over a slow connection):

root@moodle-web-9630683:~/moodle# cat php/php.ini
upload_max_filesize = 4G
post_max_size = 4G
default_socket_timeout = 360
max_input_time = 1800
max_execution_time = 1800

This should mean that the maximum time that a connection between caddy and the php-fpm container is open is 30 minutes. I can try to set the limits in the caddy file to 30 minutes too to see if that makes a difference.

However, first, the cause of connections being kept open to the FastCGI responder should be identified, because that’s not normal.

I guess you’re on the right track here: for testing purposes I restarted the php container and voilà:

image

Does this mean that caddy does not realize that a client disconnected? Because from the load balancer I know that there aren’t that many open connections to caddy. I’d expect the metric to decrease as soon as the request is done, i.e. either it is successfully served or the client disconnected for other reasons. But in the latter case it seems like the request stays in an “open” state. I do not think that caddy is waiting for a response from php, as this should time out after max_execution_time seconds and by no later than that, the request should be done…


Off Topic:

Incidentally, the FastCGI transport is not efficient

This may be true, but what’s much less efficient most of the time are the php scripts. Kind of weird to need four machines to serve 80 req/s… (Off Topic: I’m running four nodes, with caddy + php, 4x2 vCPUs, as this is cheaper than using a beefier machine with this use case) I know that one can use fastcgi for other languages too, I used to run caddy 1 with a python fastcgi-wsgi wrapper, but I guess most people that use the php_fastcgi directive actually use php and I’ve never seen caddy being a bottleneck there.

Here an image where the node is removed from the load balancer for quite some time. Caddy reports zero requests (all handlers), but requests in flight are still reported as 40:

image

The only other “server” in caddy-language is the one that provides the prometheus /metrics endpoint. There it looks as expected:

image

@SteffenDE If you open the profiling endpoint at http://localhost:2019/debug/pprof (from that machine, of course), what are the numbers on the index page? If any of them are exceedingly high, maybe attach a dump of the linked page.

Oh, and if you’re using Prometheus already, it is probably exporting that data for you? So that data might already be in your dashboards, but more presentable. I’m only guessing, but @hairyhenderson would know.

image

Nothing that I’d call “exceedingly high”, but I’ll attach the stack dump anyway. This machine currently reports 40 requests in flight as baseline when nothing happens. I also quickly removed the machine from the load balancer, so there were no requests during that time, but the metric still reported 40.

Edit: I also attached the other files that reported non-zero numbers. goroutine.txt allocs.txt heap.txt threadcreate.txt

Also, I just looked through the full config (thanks for posting that). I know this might not be feasible, but how much are you able to help narrow down whether it might be one of the third-party modules you have installed? Redis or proxy protocol… not sure if you can disable either but that might tell us very quickly whether this issue is even in the right repo. Edit: Derp, probably proxy_protocol has more to do with active requests than storage. So if you can disable proxy_protocol for a short time to test, that’d be worthwhile.

The debug endpoint I mentioned in my last comment should also shed light if they have stack traces, but I’m not sure it’s guaranteed depending on the situation. Disabling those modules (somehow…) one at a time would be a pretty surefire test.

I also thought about this, but as I’m only using these plugins with the load balanced setup and also experience this with the single server (I’ve included the Dockerfile here: https://github.com/caddyserver/caddy/issues/4009#issuecomment-772680853), I’m quite sure that this is not the issue here.

I think you’re right. There are about 30000 entries in the log with “stream closed”, way more than the increase in the metric. And I also cannot really see a connection to the timestamps.

I’m mobile so this is off the cuff but what if you disable http 3?

Ps. Thank you very much for the details!!

I’ll try this, but I can already say that the single server setup does not have HTTP/3 enabled, so I do not think that this is the cause.

Some more information: I do not think that the metric is correct. Normally I’d expect this to go back to zero when there aren’t any active requests. From the load balancer stats I can see that the number of open connections goes back to basically zero at least during the night, so there shouldn’t be any open connections any more during that time:

image

(First graph shows open connections, second graph new connections / second)