caddy: Caddy hangs when php-fpm restarts

Caddy v2.4.6 and likely most/all earlier versions of v2.

I’m using php-fpm as the back-end for processing web requests. php-fpm may occasionally restart the worker processes. IF caddy is under a high load, THEN it won’t talk to the new worker processes until after the load stops.

My Caddyfile is pretty straightforward.

{
  ##debug
  https_port 10443
  admin off

  # TLS options
  # self-signed requires: apt install libnss3-tools
  local_certs
  auto_https disable_redirects
  ocsp_stapling off
  default_sni localhost
  # Available 2021-06-07: https://github.com/caddyserver/caddy/pull/4153
  skip_install_trust

  order cgi last
}
:10443
  {
  ## Uncomment for assigned certs
  tls ../users/certs/cert.pem ../users/certs/cert.key

  ## Specify the web directory
  root ../www
  ## Use PHP (must be rw by caddy user)
  php_fastcgi unix//var/spool/ff/php7.4-fpm.sock
  ## Enable security interface
  @sec not path /server/*
  handle @sec {
    rewrite * /log.php
    }

  file_server
}

I run Caddy using: ./caddy run

The stress test (stresser.sh) just spawns 20 GET requests at a time.

#!/bin/bash
count=0
while [ 1 ] ; do
  # Upload with analysis
  curl -k -A 'stresser' \
  'https://localhost:10443/hello.php'  > /dev/null 2>&1 &

  ((count=$count+1))
  if [ $count -gt 20 ] ; then
    echo "WAIT! $(date)"
    wait
    count=0
  fi
done

The hello.php is nothing more than a hello-world reply.

PHP-FPM’s www.conf is configured for 8 static PHP workers. However, the workers can terminate if they take too long (if they are hung).

[www]
listen = /var/spool/ff/php7.4-fpm.sock
listen.owner = www-data
listen.group = www-data
pm = static
pm.max_children = 8
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 3
request_terminate_timeout = 60
catch_workers_output = yes

While stresser.sh is running, simulate a PHP worker failure:

killall php-fpm7.4
rm -f /var/spool/ff/php7.4-fpm.pid
/usr/sbin/php-fpm7.4

What happens: Caddy just hangs. It stops calling php.

Stop stresser.sh and wait 2-3 seconds. Then restart stresser.sh. What happens: Caddy works fine.

Caddy appears to not pick up new php worker connections when it’s handling incoming HTTP traffic. If the traffic stops – just for a second or two – then everything resets and it works fine.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (12 by maintainers)

Most upvoted comments

(Thanks for the thorough debugging/troubleshooting! Very interesting. I have been watching this conversation but have been too busy to reply to it. Just wanted to let you know I’m keeping an eye on it. Carry on. 🙂 )

Retries do work with single upstreams.

With caddy v2.4.6, they default to “0” (no timeout).

I don’t think this was true, I’m pretty sure the dial timeout was 10s, but it wasn’t properly documented as such.

But it’s true that read timeouts are not enabled by default.

I haven’t looked at the php-fpm code: Do you know if php-fpm sends an oob signal when a worker is terminated? And if so, does caddy catch this signal?

I haven’t either. I would hope that php-fpm would close the connection when it resets. But I dunno how it behaves.

Btw, we’re aware our fastcgi code isn’t the best. We’ve had https://github.com/caddyserver/caddy/issues/3803 open for a while, wanting to do a refactor/rewrite. But it’s really tricky to get right, we’re not experts on this protocol. There’s very few actual fastcgi client implementations in Go that we can use as inspiration.