puma: `fork_worker` keeps serving stale code indefinitely and at random after phased restart (specially after second phased restart)
Describe the bug
tl; dr: Puma 5.0.3 and 5.0.4 have a bug that, when fork_worker
is present in the configuration file, a phased restart (kill -USR1 $(cat puma.pid)
) causes about half of the workers to remain serving stale code (before the restart), indefinitely, even long after the log file shows that the last worker finished restarting.
This one took 2+ hours of debugging, but I’m now 100% certain of what triggers the bug and it’s reproducible every single time.
After upgrading from puma 4 to 5.0.3
and later 5.0.4
today, we started seeing stale code in production after deployments; however, it was at random: one page refresh had new code, the two after that had old code, and so on.
To confirm the bug, we used a constant defined in our Rails app LAST_DEPLOYED_VERSION
, which uses a simple File.read
to get a version number from a text file that our deploy script automatically bumps on every deploy.
We added that string to the response headers and started monitoring with watch -n1 curl -I http://ourdomain.com
Here’s what happens:
-
If we issue a puma restart using
systemctl restart puma.service
, the headers get updated immediately on all new requests; -
If we don’t have
fork_worker
in our config.rb file, and issue akill -USR1 $(cat puma.pid)
(phased restart), while we monitor the log with atail -f puma.stdout.log
we can watch as the old string starts to disappear and get sparser as each worker is restarted, totally disappearing when the last worker finishes restarting. -
However, if we have
fork_worker
in our config.rb file, and issue akill -USR1 $(cat puma.pid)
(phased restart), while we monitor the log with atail -f puma.stdout.log
we can observe that the old version string appears much more frequently, AND still appears in 50~60% of requests even after the log indicates that all workers have been restarted; this happens specially after the second deployment after the puma service restart, but remains forever; you can (1) wait for minutes and minutes after the last worker has finished restarting, and the old string (stale code) still is being served on around 50% of the requests, and (2) if you bump the string file and issue a newkill -USR1 $(cat puma.pid)
, the same behavior happens, now with the previous string being stuck in 50% of the requests, and so on.
The stdout.log file on a phased restart with fork_worker
has nothing unusual about it… it shows that all of our 24 workers have been restarted, but the stale code is still being served indefinitely:
[8705] + Changing to /var/www/application
[8705] - Stopping 15627 for phased upgrade...
[8705] - TERM sent to 15627...
[8705] - Worker 0 (pid: 30539) booted, phase: 2
[8705] - Stopping 21114 for phased upgrade...
[8705] - TERM sent to 21114...
[8705] - Worker 1 (pid: 2920) booted, phase: 2
[8705] - Stopping 25085 for phased upgrade...
[8705] - TERM sent to 25085...
[8705] - Worker 2 (pid: 5806) booted, phase: 2
[8705] - Stopping 29974 for phased upgrade...
[8705] - TERM sent to 29974...
[8705] - Worker 3 (pid: 9764) booted, phase: 2
[8705] - Stopping 2254 for phased upgrade...
[8705] - TERM sent to 2254...
[8705] - Worker 4 (pid: 14295) booted, phase: 2
[8705] - Stopping 7364 for phased upgrade...
[8705] - TERM sent to 7364...
[8705] - Worker 5 (pid: 19209) booted, phase: 2
[8705] - Stopping 11931 for phased upgrade...
[8705] - TERM sent to 11931...
[8705] - Worker 6 (pid: 23874) booted, phase: 2
[8705] - Stopping 15019 for phased upgrade...
[8705] - TERM sent to 15019...
[8705] - Worker 7 (pid: 27936) booted, phase: 2
[8705] - Stopping 20185 for phased upgrade...
[8705] - TERM sent to 20185...
[8705] - Worker 8 (pid: 32674) booted, phase: 2
[8705] - Stopping 24566 for phased upgrade...
[8705] - TERM sent to 24566...
[8705] - Worker 9 (pid: 5029) booted, phase: 2
[8705] - Stopping 29816 for phased upgrade...
[8705] - TERM sent to 29816...
[8705] - Worker 10 (pid: 8932) booted, phase: 2
[8705] - Stopping 2251 for phased upgrade...
[8705] - TERM sent to 2251...
[8705] - Worker 11 (pid: 13000) booted, phase: 2
[8705] - Stopping 5235 for phased upgrade...
[8705] - TERM sent to 5235...
[8705] - Worker 12 (pid: 16063) booted, phase: 2
[8705] - Stopping 10033 for phased upgrade...
[8705] - TERM sent to 10033...
[8705] - Worker 13 (pid: 18435) booted, phase: 2
[8705] - Stopping 12952 for phased upgrade...
[8705] - TERM sent to 12952...
[8705] - Worker 14 (pid: 20156) booted, phase: 2
[8705] - Stopping 18155 for phased upgrade...
[8705] - TERM sent to 18155...
[8705] - Worker 15 (pid: 21735) booted, phase: 2
[8705] - Stopping 22855 for phased upgrade...
[8705] - TERM sent to 22855...
[8705] - Worker 16 (pid: 23203) booted, phase: 2
[8705] - Stopping 27509 for phased upgrade...
[8705] - TERM sent to 27509...
[8705] - Worker 17 (pid: 24777) booted, phase: 2
[8705] - Stopping 32049 for phased upgrade...
[8705] - TERM sent to 32049...
[8705] - Worker 18 (pid: 26341) booted, phase: 2
[8705] - Stopping 1937 for phased upgrade...
[8705] - TERM sent to 1937...
[8705] - Worker 19 (pid: 27696) booted, phase: 2
[8705] - Stopping 5904 for phased upgrade...
[8705] - TERM sent to 5904...
[8705] - Worker 20 (pid: 29269) booted, phase: 2
[8705] - Stopping 10939 for phased upgrade...
[8705] - TERM sent to 10939...
[8705] - Worker 21 (pid: 31071) booted, phase: 2
[8705] - Stopping 15926 for phased upgrade...
[8705] - TERM sent to 15926...
[8705] - Worker 22 (pid: 32708) booted, phase: 2
[8705] - Stopping 20589 for phased upgrade...
[8705] - TERM sent to 20589...
[8705] - Worker 23 (pid: 1940) booted, phase: 2
Here’s a quick video of the version string (tl_app_version
) more than 10 minutes after we observed the last worker restarting on the logs; notice how prevalent the old string (in this case, 7.397
) is when compared to the current version’s string (7.398
):
https://www.loom.com/share/7195782bde3843738d26696f8a2af89f
As soon as we remove fork_worker
from the config.rb
and issue a systemctl restart puma
, we can issue how many deployments we wish using kill -USR1 $(cat puma.pid)
, and we never, not even once, observed an old version string being served in any request issued after the last worker rebooted.
The server is an Ubuntu 18.04 LTS, the bug was confirmed on both 5.0.3
and 5.0.4
, and here’s our systemd service, systemd socket activation file, and puma.rb config:
[Unit]
Description=Puma {{ param_app_name }} application server
After=network.target
# Socket activation
Requires=puma-{{ param_app_name}}.socket
[Service]
Type=simple
User={{ param_user_name }}
WorkingDirectory=/var/www/{{ param_app_name }}/application
ExecStart=/home/{{ param_user_name }}/.rbenv/shims/bundle exec --keep-file-descriptors puma -C /var/www/{{ param_app_name }}/shared/config/puma.rb
Restart=always
[Install]
WantedBy=multi-user.target
[Unit]
Description=Puma HTTP Server Accept Sockets for {{ param_app_name }}
[Socket]
SocketUser={{ param_user_name }}
SocketGroup={{ param_user_name }}
ListenStream=/var/www/{{ param_app_name }}/shared/tmp/sockets/puma-{{ param_app_name }}.sock
# Socket options matching Puma defaults
NoDelay=true
ReusePort=true
Backlog=1024
[Install]
WantedBy=sockets.target
shared_dir = "/var/www/{{ param_app_name }}/shared"
FileUtils.mkdir_p("#{shared_dir}/tmp/pids")
workers 24
threads 1, 1
bind "unix://#{shared_dir}/tmp/sockets/puma-{{ param_app_name }}.sock"
environment "production"
stdout_redirect "#{shared_dir}/log/puma.stdout.log", "#{shared_dir}/log/puma.stderr.log", true
pidfile "#{shared_dir}/tmp/pids/puma.pid"
fork_worker
before_fork do
ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end
on_worker_boot do
ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
if defined?(MessageBus)
MessageBus.after_fork
end
end
lowlevel_error_handler do |e|
Honeybadger.notify(e) if defined?(Honeybadger)
[500, {}, ["Error\n"]]
end
If any of the maintainers want to take a look at this live, just reach out and we can do a Jitsi / Hangouts session or something.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (10 by maintainers)
I have a rough idea on what’s going on, (ping-timeout on worker 0 while stopped waiting for the other workers to upgrade), but haven’t carved out time to put together a proper fix yet.
Since the issue is triggered by a combination of the
worker_timeout
/worker_boot_timeout
30-second default value (not sure which of these affects this issue) and a large amount of workers (> 11 it seems), you can try increasing the value ofworker_timeout
/worker_boot_timeout
to larger values (e.g., 300) as a workaround for now.Or at the very least you could just switch off the
fork_worker
option entirely to avoid this issue without needing to downgrade to version 4 for now until a fix is ready.Thanks for the detailed bug report @feliperaul, I was able to reproduce a similar issue locally and will try to sort out what’s going on.
While reproducing the issue, I saw a “timed out worker” line appear in the log in the middle of a phased restart (between Worker 11 and Worker 12 for a 24-worker cluster), which seems to be the initial trigger of the issue:
You mentioned that you didn’t notice anything unusual in your log file- just to be sure, can you look back in the log and see if you find any similar
Terminating timed out worker
line appear in the middle of a phased-restart? If your log has the same error I should be close to tracking this down, if not I might need to keep digging a bit more.Let’s go with that 😃 We can always re-open if needed
@h0jeZvgoxFepBQ2C That would not be the case, I have reproduced the bug waiting for the first deployment fully finishing to restart all 24 workers before issuing the second deployment.
@wjordan I found this grepping the file, and confirmed they appeared during my phased restart tests with
fork_worker