puma: `fork_worker` keeps serving stale code indefinitely and at random after phased restart (specially after second phased restart)

Describe the bug tl; dr: Puma 5.0.3 and 5.0.4 have a bug that, when fork_worker is present in the configuration file, a phased restart (kill -USR1 $(cat puma.pid)) causes about half of the workers to remain serving stale code (before the restart), indefinitely, even long after the log file shows that the last worker finished restarting.

This one took 2+ hours of debugging, but I’m now 100% certain of what triggers the bug and it’s reproducible every single time.

After upgrading from puma 4 to 5.0.3 and later 5.0.4 today, we started seeing stale code in production after deployments; however, it was at random: one page refresh had new code, the two after that had old code, and so on.

To confirm the bug, we used a constant defined in our Rails app LAST_DEPLOYED_VERSION, which uses a simple File.read to get a version number from a text file that our deploy script automatically bumps on every deploy.

We added that string to the response headers and started monitoring with watch -n1 curl -I http://ourdomain.com

Here’s what happens:

If we issue a puma restart using systemctl restart puma.service, the headers get updated immediately on all new requests;
If we don’t have fork_worker in our config.rb file, and issue a kill -USR1 $(cat puma.pid) (phased restart), while we monitor the log with a tail -f puma.stdout.log we can watch as the old string starts to disappear and get sparser as each worker is restarted, totally disappearing when the last worker finishes restarting.
However, if we have fork_worker in our config.rb file, and issue a kill -USR1 $(cat puma.pid) (phased restart), while we monitor the log with a tail -f puma.stdout.log we can observe that the old version string appears much more frequently, AND still appears in 50~60% of requests even after the log indicates that all workers have been restarted; this happens specially after the second deployment after the puma service restart, but remains forever; you can (1) wait for minutes and minutes after the last worker has finished restarting, and the old string (stale code) still is being served on around 50% of the requests, and (2) if you bump the string file and issue a new kill -USR1 $(cat puma.pid), the same behavior happens, now with the previous string being stuck in 50% of the requests, and so on.

The stdout.log file on a phased restart with fork_worker has nothing unusual about it… it shows that all of our 24 workers have been restarted, but the stale code is still being served indefinitely:

[8705] + Changing to /var/www/application
[8705] - Stopping 15627 for phased upgrade...
[8705] - TERM sent to 15627...
[8705] - Worker 0 (pid: 30539) booted, phase: 2
[8705] - Stopping 21114 for phased upgrade...
[8705] - TERM sent to 21114...
[8705] - Worker 1 (pid: 2920) booted, phase: 2
[8705] - Stopping 25085 for phased upgrade...
[8705] - TERM sent to 25085...
[8705] - Worker 2 (pid: 5806) booted, phase: 2
[8705] - Stopping 29974 for phased upgrade...
[8705] - TERM sent to 29974...
[8705] - Worker 3 (pid: 9764) booted, phase: 2
[8705] - Stopping 2254 for phased upgrade...
[8705] - TERM sent to 2254...
[8705] - Worker 4 (pid: 14295) booted, phase: 2
[8705] - Stopping 7364 for phased upgrade...
[8705] - TERM sent to 7364...
[8705] - Worker 5 (pid: 19209) booted, phase: 2
[8705] - Stopping 11931 for phased upgrade...
[8705] - TERM sent to 11931...
[8705] - Worker 6 (pid: 23874) booted, phase: 2
[8705] - Stopping 15019 for phased upgrade...
[8705] - TERM sent to 15019...
[8705] - Worker 7 (pid: 27936) booted, phase: 2
[8705] - Stopping 20185 for phased upgrade...
[8705] - TERM sent to 20185...
[8705] - Worker 8 (pid: 32674) booted, phase: 2
[8705] - Stopping 24566 for phased upgrade...
[8705] - TERM sent to 24566...
[8705] - Worker 9 (pid: 5029) booted, phase: 2
[8705] - Stopping 29816 for phased upgrade...
[8705] - TERM sent to 29816...
[8705] - Worker 10 (pid: 8932) booted, phase: 2
[8705] - Stopping 2251 for phased upgrade...
[8705] - TERM sent to 2251...
[8705] - Worker 11 (pid: 13000) booted, phase: 2
[8705] - Stopping 5235 for phased upgrade...
[8705] - TERM sent to 5235...
[8705] - Worker 12 (pid: 16063) booted, phase: 2
[8705] - Stopping 10033 for phased upgrade...
[8705] - TERM sent to 10033...
[8705] - Worker 13 (pid: 18435) booted, phase: 2
[8705] - Stopping 12952 for phased upgrade...
[8705] - TERM sent to 12952...
[8705] - Worker 14 (pid: 20156) booted, phase: 2
[8705] - Stopping 18155 for phased upgrade...
[8705] - TERM sent to 18155...
[8705] - Worker 15 (pid: 21735) booted, phase: 2
[8705] - Stopping 22855 for phased upgrade...
[8705] - TERM sent to 22855...
[8705] - Worker 16 (pid: 23203) booted, phase: 2
[8705] - Stopping 27509 for phased upgrade...
[8705] - TERM sent to 27509...
[8705] - Worker 17 (pid: 24777) booted, phase: 2
[8705] - Stopping 32049 for phased upgrade...
[8705] - TERM sent to 32049...
[8705] - Worker 18 (pid: 26341) booted, phase: 2
[8705] - Stopping 1937 for phased upgrade...
[8705] - TERM sent to 1937...
[8705] - Worker 19 (pid: 27696) booted, phase: 2
[8705] - Stopping 5904 for phased upgrade...
[8705] - TERM sent to 5904...
[8705] - Worker 20 (pid: 29269) booted, phase: 2
[8705] - Stopping 10939 for phased upgrade...
[8705] - TERM sent to 10939...
[8705] - Worker 21 (pid: 31071) booted, phase: 2
[8705] - Stopping 15926 for phased upgrade...
[8705] - TERM sent to 15926...
[8705] - Worker 22 (pid: 32708) booted, phase: 2
[8705] - Stopping 20589 for phased upgrade...
[8705] - TERM sent to 20589...
[8705] - Worker 23 (pid: 1940) booted, phase: 2

Here’s a quick video of the version string (tl_app_version) more than 10 minutes after we observed the last worker restarting on the logs; notice how prevalent the old string (in this case, 7.397) is when compared to the current version’s string (7.398):

https://www.loom.com/share/7195782bde3843738d26696f8a2af89f

As soon as we remove fork_worker from the config.rb and issue a systemctl restart puma, we can issue how many deployments we wish using kill -USR1 $(cat puma.pid), and we never, not even once, observed an old version string being served in any request issued after the last worker rebooted.

The server is an Ubuntu 18.04 LTS, the bug was confirmed on both 5.0.3 and 5.0.4, and here’s our systemd service, systemd socket activation file, and puma.rb config:

[Unit]
Description=Puma {{ param_app_name }} application server
After=network.target

# Socket activation
Requires=puma-{{ param_app_name}}.socket

[Service]
Type=simple

User={{ param_user_name }}

WorkingDirectory=/var/www/{{ param_app_name }}/application

ExecStart=/home/{{ param_user_name }}/.rbenv/shims/bundle exec --keep-file-descriptors puma -C /var/www/{{ param_app_name }}/shared/config/puma.rb

Restart=always

[Install]
WantedBy=multi-user.target

[Unit]
Description=Puma HTTP Server Accept Sockets for {{ param_app_name }}

[Socket]
SocketUser={{ param_user_name }}
SocketGroup={{ param_user_name }}
ListenStream=/var/www/{{ param_app_name }}/shared/tmp/sockets/puma-{{ param_app_name }}.sock

# Socket options matching Puma defaults
NoDelay=true
ReusePort=true
Backlog=1024

[Install]
WantedBy=sockets.target

shared_dir = "/var/www/{{ param_app_name }}/shared"

FileUtils.mkdir_p("#{shared_dir}/tmp/pids")

workers 24
threads 1, 1

bind "unix://#{shared_dir}/tmp/sockets/puma-{{ param_app_name }}.sock"

environment "production"

stdout_redirect "#{shared_dir}/log/puma.stdout.log", "#{shared_dir}/log/puma.stderr.log", true

pidfile "#{shared_dir}/tmp/pids/puma.pid"

fork_worker

before_fork do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end

on_worker_boot do

  ActiveRecord::Base.establish_connection if defined?(ActiveRecord)

  if defined?(MessageBus)
    MessageBus.after_fork
  end

end

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  [500, {}, ["Error\n"]]
end

If any of the maintainers want to take a look at this live, just reach out and we can do a Jitsi / Hangouts session or something.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (10 by maintainers)

Most upvoted comments

Is there any solution for this? We have downgraded to puma 4 again due to this issue…

I have a rough idea on what’s going on, (ping-timeout on worker 0 while stopped waiting for the other workers to upgrade), but haven’t carved out time to put together a proper fix yet.

Since the issue is triggered by a combination of the worker_timeout / worker_boot_timeout 30-second default value (not sure which of these affects this issue) and a large amount of workers (> 11 it seems), you can try increasing the value of worker_timeout / worker_boot_timeout to larger values (e.g., 300) as a workaround for now.

Or at the very least you could just switch off the fork_worker option entirely to avoid this issue without needing to downgrade to version 4 for now until a fix is ready.

wjordan on Nov 10, 2020

Thanks for the detailed bug report @feliperaul, I was able to reproduce a similar issue locally and will try to sort out what’s going on.

While reproducing the issue, I saw a “timed out worker” line appear in the log in the middle of a phased restart (between Worker 11 and Worker 12 for a 24-worker cluster), which seems to be the initial trigger of the issue:

[672227] - Stopping 672327 for phased upgrade...
[672227] - TERM sent to 672327...
[672227] - Worker 11 (pid: 675868) booted, phase: 1
[672227] - Stopping 672337 for phased upgrade...
[672227] - TERM sent to 672337...
[672227] ! Terminating timed out worker: 674236
[672227] - Worker 0 (pid: 676012) booted, phase: 1
[672227] - Worker 12 (pid: 676015) booted, phase: 1

You mentioned that you didn’t notice anything unusual in your log file- just to be sure, can you look back in the log and see if you find any similar Terminating timed out worker line appear in the middle of a phased-restart? If your log has the same error I should be close to tracking this down, if not I might need to keep digging a bit more.

wjordan on Oct 29, 2020

Let’s go with that 😃 We can always re-open if needed

dentarg on Oct 21, 2023

@h0jeZvgoxFepBQ2C That would not be the case, I have reproduced the bug waiting for the first deployment fully finishing to restart all 24 workers before issuing the second deployment.

feliperaul on Oct 30, 2020

@wjordan I found this grepping the file, and confirmed they appeared during my phased restart tests with fork_worker

[2858] ! Terminating timed out worker: 9948
[2171] ! Terminating timed out worker: 15202
[2154] ! Terminating timed out worker: 21353
[32148] ! Terminating timed out worker: 25794

feliperaul on Oct 29, 2020