caddy: v2.6.4 php-fpm gets stuck in read state

Since we upgraded from v2.6.3 to v2.6.4, we are experiencing some problems with our php-fpm backend

The problem is that the php-fpm process gets stuck in a “read” state (and not go back to “accept” state). Eventually php’s pm.max_children is reached and the entire server will be killed by ec2’s health checker.

I’ve done some debugging with strace, but have not found a pattern, because the servers gets lots of traffic, but the problem only occurs every once in a while. (maybe once in a 1000 requests)

root@frontend:~# strace -p 39614
strace: Process 39614 attached
read(5, 

root@frontend:~# nsenter -t 39614 --net lsof -n -p 39614 | grep 5u
php-fpm 39614 www-data    5u  IPv6 287757      0t0     TCP 172.18.0.3:9000->172.18.0.4:44492 (ESTABLISHED)

172.18.0.3 = app_php_1.app_default
172.18.0.4 = app_caddy_1.app_default

Im afraid that #5367 might need to be revisited. I’ve reverted back to v2.6.3 and the problem is gone.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (11 by maintainers)

Commits related to this issue

Most upvoted comments

This turned out to be more complicated than I expected (weird! 😅 )

Setting the Content-Length does prevent the requests from hanging as I’ve done in #5435. I wrote up a custom integration suite that verifies that verifies that outcome.

However, I suspect this will mean only preventing those requests from hanging and not doing anything about the any underlying issue.

I discovered that FastCGI doesn’t allow for transfer encodings, and indeed Caddy’s reverse proxying rightly categorizes the Transfer-Encoding header as a “hop-by-hop” header that shouldn’t be transferred to any upstreams including FastCGI:

https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/reverseproxy.go#L1375

Caddy accounts for this in its HTTP Transport for reverse proxying and allows for buffering by default. The issue seems to just be that it doesn’t do so for FastCGI.

For a point of comparison, nginx has buffering for fastcgi on by default 4/8k in memory and then to tmpfile up to 1gb both configurable limits.

As a recap:

Prior to Caddy 2.6.3: Requests to PHP backends that had Transfer-Encoding: chunked were hanging #5236 Caddy 2.6.3: Included #5289 which allowed for an exception with the check on whether or not to buffer the request to account for Transfer-Encoding: chunked, at the risk of OOMing Caddy Caddy 2.6.4: Reverts #5289 via #5367 which means that the requests go back to hanging #5435: Prevents the requests from hanging but only if buffering is enabled Future?: Caddy implements buffering in the FastCGI transport negating the need for enabling buffering by default when using FastCGI

@trea Excellent, thanks for confirming.

We’ll make note of the need for FastCGI to buffer. @WeidiDeng is currently our FastCGI expert – I’m not sure we’re ready to implement temp file buffers though, but maybe we are? I dunno.

I’m sure we’d also accept a PR for that too if you’re interested 😉

Thank you for the carefully-researched contribution and the effort you have made for an elegant fix!

Actually, after looking into this more, I think you were right when you suggested #5367 was related, and I do think #5431 is related as well. I think they specifically go back to this change:

https://github.com/caddyserver/caddy/pull/5367/files#diff-a248c9a1ec018edea8b377d155bc1df1a642bf79d00ababb5cdacc6b86c5733dL627

The FastCGI specification states that Responders read up to the total of CONTENT_LENGTH, but as a result of 5367, that is now empty for requests with Transfer-Encoding: chunked.

I have updated my repo to include both versions of Caddy and two more services in the Docker Compose file chunked and chunked264. When I run docker-compose chunked264 PHP actually dies and is respawned immediately. You can call docker-compose run chunked all you want and it works fine.

I understand the reverting of the buffer settings, but I think Caddy should guarantee that CONTENT_LENGTH is set to an int (0 is fine and even desirable) at least as far as the FastCGI transport is concerned. I’d be happy to submit a PR if that’d be welcome.

I think changing this to guard against empty strings and send a 0 instead of an empty string for CONTENT_LENGTH would do it: https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/fastcgi/fastcgi.go#L299

In fact, it’s done for the FastCGI protocol elsewhere 🤔 : https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/fastcgi/fastcgi.go#L186

Thanks, the health.php file you provided looks fine.

I stand by my theory though that you have something else that’s causing PHP to not return quickly enough for your call to /health.php. I did create a reproduction using Docker Compose and put it in this repo: https://github.com/trea/php-fpm-issue

If you follow the README (and you may have to run the healthcheck a few times), you end up with requests that take too long. The healthcheck fails and then autoheal restarts it and Caddy can’t connect to PHP anymore until it’s finished restarting:

“error”: “dialing backend: dial tcp 192.168.48.6:9000: connect: connection refused”

Log Output
php-fpm-exit-test-caddy-1     | {
  "level": "debug",
  "ts": 1678475769.7431428,
  "logger": "http.handlers.reverse_proxy",
  "msg": "upstream roundtrip",
  "upstream": "php:9000",
  "duration": 0.000587583,
  "request": {
    "remote_ip": "192.168.48.4",
    "remote_port": "54200",
    "proto": "HTTP/1.1",
    "method": "GET",
    "host": "caddy",
    "uri": "/slow.php",
    "headers": {
      "X-Forwarded-Proto": [
        "http"
      ],
      "X-Forwarded-Host": [
        "caddy"
      ],
      "User-Agent": [
        "fortio.org/fortio-1.53.1"
      ],
      "X-Forwarded-For": [
        "192.168.48.4"
      ]
    }
  },
  "error": "dialing backend: dial tcp 192.168.48.6:9000: connect: connection refused"
}

I would recommend setting up a slowlog for FPM and ensuring that the ptrace capability is available, that way you should be able to get a better idea of what is causing FPM to hang.

In this test case the output points to curl_exec:

php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] WARNING: [pool www] child 8, script '/var/www/html/slow.php' (request: "GET /slow.php") executing too slow (2.210546 sec), logging
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] WARNING: [pool www] child 7, script '/var/www/html/slow.php' (request: "GET /slow.php") executing too slow (2.220195 sec), logging
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] NOTICE: child 8 stopped for tracing
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] NOTICE: about to trace 8
php-fpm-exit-test-php-1       |
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48]  [pool www] pid 8
php-fpm-exit-test-php-1       | script_filename = /var/www/html/slow.php
php-fpm-exit-test-php-1       | [0x0000ffff9f2130c0] curl_exec() /var/www/html/slow.php:6