puma: Child workers becoming zombies/unreachable after exception
Describe the bug Every once in a while (it’s happening aprox. monthly), we are experiencing downtime due to puma freezing completely.
When I ssh into the server, I see all 25 workers are there, sitting idle with 0 CPU usage.
Investigating the logs, puma.stderr and puma.stdout give me nothing; but in production.log
, I can always trace back the last request served, and this is what I get in Honeybadger:
fatal: machine stack overflow in critical region
Backtrace: middleware_tracing.rb > 99 > call(...)
97 | events.notify(:before_call, env) if first_middleware
-- | --
98 |
99 | result = (target == self) ? traced_call(env) : target.call(env)
100 |
101 | if first_middleware
It seems that this exception might be caused by WickedPDF
(and remotipart
together?) - see https://github.com/mileszs/wicked_pdf/issues/891 and https://github.com/mileszs/wicked_pdf/issues/810.
However, despite being a long shot, I’m opening this issue here in Puma because maybe something could be done by Puma to prevent the entire worker pool halting to a stall.
A simple sudo systemctl restart puma.service
puts us back online.
After the incident, I can never reproduce the problem manually, not even hitting the same URL that triggered the “machine stack overflow in critical region”.
Puma config:
Puma version 5.0.4
[Unit]
Description=Puma {{ param_app_name }} application server
After=network.target
# Socket activation
Requires=puma-{{ param_app_name}}.socket
[Service]
Type=simple
User={{ param_user_name }}
WorkingDirectory=/var/www/{{ param_app_name }}/application
ExecStart=/home/{{ param_user_name }}/.rbenv/shims/bundle exec --keep-file-descriptors puma -C /var/www/{{ param_app_name }}/shared/config/puma.rb
Restart=always
[Install]
WantedBy=multi-user.target
[Unit]
Description=Puma HTTP Server Accept Sockets for {{ param_app_name }}
[Socket]
SocketUser={{ param_user_name }}
SocketGroup={{ param_user_name }}
ListenStream=/var/www/{{ param_app_name }}/shared/tmp/sockets/puma-{{ param_app_name }}.sock
# Socket options matching Puma defaults
NoDelay=true
ReusePort=true
Backlog=1024
[Install]
WantedBy=sockets.target
shared_dir = "/var/www/{{ param_app_name }}/shared"
FileUtils.mkdir_p("#{shared_dir}/tmp/pids")
workers 24
threads 1, 1
bind "unix://#{shared_dir}/tmp/sockets/puma-{{ param_app_name }}.sock"
environment "production"
stdout_redirect "#{shared_dir}/log/puma.stdout.log", "#{shared_dir}/log/puma.stderr.log", true
pidfile "#{shared_dir}/tmp/pids/puma.pid"
before_fork do
ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end
on_worker_boot do
ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
if defined?(MessageBus)
MessageBus.after_fork
end
end
lowlevel_error_handler do |e|
Honeybadger.notify(e) if defined?(Honeybadger)
[500, {}, ["Error\n"]]
end
To Reproduce Unfortunately, it happens completely at random.
Desktop (please complete the following information): Ubuntu 18.04.LTS Puma version 5.0.4
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 27 (16 by maintainers)
Cool. Thanks.
Well, the child can’t rescue it’s own segfault. It can rescue it’s own exceptions, but this type of thing can’t be rescued. That worker is dead and must be killed and restarted.
I’ll have to look at how our child process heartbeat code works again. I don’t think it’s unreasonable for a master process to try to kill and restart an unresponsive child.
I was on version
2.1.0
.@nateberkopec
Looking again on the exception on Honeybadger, I had 24 occurences of that exception yesterday, spread between 10 hours; I have exactly 24 workers, so this must not be a coincidence.
Analyzing the logs alongside, I can confirm that after the LAST exception we stopped processing all requests until I woke up this morning and restarted puma.
So what probably happened is that on every one of this exceptions, that specific child worker halted and stopped processing requests, so capacity was reducing during the day but we only noticed when the last child worker suffered it.
I know puma is not causing this, but it would be great if this exception/segfault could be somehow rescued instead of freezing the worker, or at least sending a signal to the master process to re-spawn that worker.
@calvinxiao Calvin, thanks, your solution is much simpler and doesn’t require spawning a new Thread for constant monitoring.
Today I checked my logs and my
PumaStaleWorkersMonitor
has saved us at least once from downtime. So, it’s working!Regardless, I will change to your proposed solution and report back any news.
Thanks to all you guys for helping me with this. I think future Googlers will find this issue helpful as well. Cheers.
@feliperaul , I propose this temporary fix to your problem, see if it helps.
Seems not all
fatal
error is recoverable, it’s better to kill the worker in your case.@calvinxiao
Calvin, my mistake: I just reviewed my
puma.rb
config and we haven’t been usingfork_worker
for well over one year. We use Ansible to template thatpuma.rb
file, and thatfork_worker
line was inside an if clause that we currently have set to false, so it was my mistake to paste it here. I just edited the inaugural comment to remove the mention tofork_worker
to not mislead anyone looking at this.Regarding
preload_app!
, we can’t use it because it’s incompatible with phased restarts, which we now rely heavily; (a hot restart were stopping all our workers for 10+ seconds during the restart - requests were sitting there waiting due to socket activation; on the other hand, phased restart is flawless and doesn’t disturb our users in any way. Amazing feature.I updated puma.rb to
rescue nil
the Honeybadger line, as you suggested; the Honeybadger gem doesn’t need to be started or stooped, it just works. To be honest, we can even remove thelowlevel_error_handler
from our puma config.Now that the issue is reproducible, what do you all think about the idea of checking periodically if the
running
threads is < than themin_threads
config and, if so, respawn that worker?Near the top of the first post, it shows a HoneyBadger log with
Server#lowlevel_error
calls the handler (HoneyBadger) specified by@options[:lowlevel_error_handler]
, then logs the backtrace. But, with that errorbacktrace
isnil
, so Puma was throwing an error. I think.@nateberkopec JSON.dump and cyclic hash can cause
fatal: machine stack overflow in critical region
Being able to reproduce the exception, I found that when
e.class == "fatal"
there is no#backtrace
. and inlowlevel_error
it assumese.backtrace
is an array.Here is the log when
fatal: machine stack overflow in critical region
happensI think the fix is to check if
e.backtrace
is nil.