caddy: v2.6.4 php-fpm gets stuck in read state
Since we upgraded from v2.6.3 to v2.6.4, we are experiencing some problems with our php-fpm backend
The problem is that the php-fpm process gets stuck in a “read” state (and not go back to “accept” state). Eventually php’s pm.max_children is reached and the entire server will be killed by ec2’s health checker.
I’ve done some debugging with strace, but have not found a pattern, because the servers gets lots of traffic, but the problem only occurs every once in a while. (maybe once in a 1000 requests)
root@frontend:~# strace -p 39614
strace: Process 39614 attached
read(5,
root@frontend:~# nsenter -t 39614 --net lsof -n -p 39614 | grep 5u
php-fpm 39614 www-data 5u IPv6 287757 0t0 TCP 172.18.0.3:9000->172.18.0.4:44492 (ESTABLISHED)
172.18.0.3 = app_php_1.app_default
172.18.0.4 = app_caddy_1.app_default
Im afraid that #5367 might need to be revisited. I’ve reverted back to v2.6.3 and the problem is gone.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (11 by maintainers)
This turned out to be more complicated than I expected (weird! 😅 )
Setting the
Content-Lengthdoes prevent the requests from hanging as I’ve done in #5435. I wrote up a custom integration suite that verifies that verifies that outcome.However, I suspect this will mean only preventing those requests from hanging and not doing anything about the any underlying issue.
I discovered that FastCGI doesn’t allow for transfer encodings, and indeed Caddy’s reverse proxying rightly categorizes the
Transfer-Encodingheader as a “hop-by-hop” header that shouldn’t be transferred to any upstreams including FastCGI:https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/reverseproxy.go#L1375
Caddy accounts for this in its HTTP Transport for reverse proxying and allows for buffering by default. The issue seems to just be that it doesn’t do so for FastCGI.
For a point of comparison, nginx has buffering for fastcgi on by default 4/8k in memory and then to tmpfile up to 1gb both configurable limits.
As a recap:
Prior to Caddy 2.6.3: Requests to PHP backends that had
Transfer-Encoding: chunkedwere hanging #5236 Caddy 2.6.3: Included #5289 which allowed for an exception with the check on whether or not to buffer the request to account forTransfer-Encoding: chunked, at the risk of OOMing Caddy Caddy 2.6.4: Reverts #5289 via #5367 which means that the requests go back to hanging #5435: Prevents the requests from hanging but only if buffering is enabled Future?: Caddy implements buffering in the FastCGI transport negating the need for enabling buffering by default when using FastCGI@trea Excellent, thanks for confirming.
We’ll make note of the need for FastCGI to buffer. @WeidiDeng is currently our FastCGI expert – I’m not sure we’re ready to implement temp file buffers though, but maybe we are? I dunno.
I’m sure we’d also accept a PR for that too if you’re interested 😉
Thank you for the carefully-researched contribution and the effort you have made for an elegant fix!
Actually, after looking into this more, I think you were right when you suggested #5367 was related, and I do think #5431 is related as well. I think they specifically go back to this change:
https://github.com/caddyserver/caddy/pull/5367/files#diff-a248c9a1ec018edea8b377d155bc1df1a642bf79d00ababb5cdacc6b86c5733dL627
The FastCGI specification states that Responders read up to the total of CONTENT_LENGTH, but as a result of 5367, that is now empty for requests with
Transfer-Encoding: chunked.I have updated my repo to include both versions of Caddy and two more services in the Docker Compose file
chunkedandchunked264. When I rundocker-compose chunked264PHP actually dies and is respawned immediately. You can calldocker-compose run chunkedall you want and it works fine.I understand the reverting of the buffer settings, but I think Caddy should guarantee that
CONTENT_LENGTHis set to an int (0 is fine and even desirable) at least as far as the FastCGI transport is concerned. I’d be happy to submit a PR if that’d be welcome.I think changing this to guard against empty strings and send a 0 instead of an empty string for
CONTENT_LENGTHwould do it: https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/fastcgi/fastcgi.go#L299In fact, it’s done for the FastCGI protocol elsewhere 🤔 : https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/fastcgi/fastcgi.go#L186
Thanks, the health.php file you provided looks fine.
I stand by my theory though that you have something else that’s causing PHP to not return quickly enough for your call to /health.php. I did create a reproduction using Docker Compose and put it in this repo: https://github.com/trea/php-fpm-issue
If you follow the README (and you may have to run the healthcheck a few times), you end up with requests that take too long. The healthcheck fails and then autoheal restarts it and Caddy can’t connect to PHP anymore until it’s finished restarting:
Log Output
I would recommend setting up a slowlog for FPM and ensuring that the ptrace capability is available, that way you should be able to get a better idea of what is causing FPM to hang.
In this test case the output points to
curl_exec: