puma: Child workers becoming zombies/unreachable after exception

Describe the bug Every once in a while (it’s happening aprox. monthly), we are experiencing downtime due to puma freezing completely.

When I ssh into the server, I see all 25 workers are there, sitting idle with 0 CPU usage.

Investigating the logs, puma.stderr and puma.stdout give me nothing; but in production.log, I can always trace back the last request served, and this is what I get in Honeybadger:

fatal: machine stack overflow in critical region

Backtrace: middleware_tracing.rb  > 99 >  call(...)

97 | events.notify(:before_call, env) if first_middleware
-- | --
98 |  
99 | result = (target == self) ? traced_call(env) : target.call(env)
100 |  
101 | if first_middleware

It seems that this exception might be caused by WickedPDF (and remotipart together?) - see https://github.com/mileszs/wicked_pdf/issues/891 and https://github.com/mileszs/wicked_pdf/issues/810.

However, despite being a long shot, I’m opening this issue here in Puma because maybe something could be done by Puma to prevent the entire worker pool halting to a stall.

A simple sudo systemctl restart puma.service puts us back online.

After the incident, I can never reproduce the problem manually, not even hitting the same URL that triggered the “machine stack overflow in critical region”.

Puma config:

Puma version 5.0.4

[Unit]
Description=Puma {{ param_app_name }} application server
After=network.target

# Socket activation
Requires=puma-{{ param_app_name}}.socket

[Service]
Type=simple

User={{ param_user_name }}

WorkingDirectory=/var/www/{{ param_app_name }}/application

ExecStart=/home/{{ param_user_name }}/.rbenv/shims/bundle exec --keep-file-descriptors puma -C /var/www/{{ param_app_name }}/shared/config/puma.rb

Restart=always

[Install]
WantedBy=multi-user.target

[Unit]
Description=Puma HTTP Server Accept Sockets for {{ param_app_name }}

[Socket]
SocketUser={{ param_user_name }}
SocketGroup={{ param_user_name }}
ListenStream=/var/www/{{ param_app_name }}/shared/tmp/sockets/puma-{{ param_app_name }}.sock

# Socket options matching Puma defaults
NoDelay=true
ReusePort=true
Backlog=1024

[Install]
WantedBy=sockets.target

shared_dir = "/var/www/{{ param_app_name }}/shared"

FileUtils.mkdir_p("#{shared_dir}/tmp/pids")

workers 24
threads 1, 1

bind "unix://#{shared_dir}/tmp/sockets/puma-{{ param_app_name }}.sock"

environment "production"

stdout_redirect "#{shared_dir}/log/puma.stdout.log", "#{shared_dir}/log/puma.stderr.log", true

pidfile "#{shared_dir}/tmp/pids/puma.pid"

before_fork do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end

on_worker_boot do

  ActiveRecord::Base.establish_connection if defined?(ActiveRecord)

  if defined?(MessageBus)
    MessageBus.after_fork
  end

end

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  [500, {}, ["Error\n"]]
end

To Reproduce Unfortunately, it happens completely at random.

Desktop (please complete the following information): Ubuntu 18.04.LTS Puma version 5.0.4

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 27 (16 by maintainers)

Most upvoted comments

Cool. Thanks.

Well, the child can’t rescue it’s own segfault. It can rescue it’s own exceptions, but this type of thing can’t be rescued. That worker is dead and must be killed and restarted.

I’ll have to look at how our child process heartbeat code works again. I don’t think it’s unreasonable for a master process to try to kill and restart an unresponsive child.

nateberkopec on Feb 9, 2021

@feliperaul I’m curious, what wicked_pdf version are you using?

(I’m doubtful that Puma can do something about the problem you are experiencing)

I was on version 2.1.0.

@nateberkopec

Looking again on the exception on Honeybadger, I had 24 occurences of that exception yesterday, spread between 10 hours; I have exactly 24 workers, so this must not be a coincidence.

Analyzing the logs alongside, I can confirm that after the LAST exception we stopped processing all requests until I woke up this morning and restarted puma.

So what probably happened is that on every one of this exceptions, that specific child worker halted and stopped processing requests, so capacity was reducing during the day but we only noticed when the last child worker suffered it.

I know puma is not causing this, but it would be great if this exception/segfault could be somehow rescued instead of freezing the worker, or at least sending a signal to the master process to re-spawn that worker.

feliperaul on Feb 9, 2021

@calvinxiao Calvin, thanks, your solution is much simpler and doesn’t require spawning a new Thread for constant monitoring.

Today I checked my logs and my PumaStaleWorkersMonitor has saved us at least once from downtime. So, it’s working!

Regardless, I will change to your proposed solution and report back any news.

Thanks to all you guys for helping me with this. I think future Googlers will find this issue helpful as well. Cheers.

feliperaul on May 24, 2021

@feliperaul , I propose this temporary fix to your problem, see if it helps.

Seems not all fatal error is recoverable, it’s better to kill the worker in your case.

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  if e.class.to_s == 'fatal'
    STDERR.puts "Worker[#{Process.pid}] encounters fatal error <#{e}>, sending SIGTERM to itself" 
    Process.kill('TERM', Process.pid)
  end

  [500, {}, ["Error\n"]]
end

calvinxiao on May 6, 2021

@calvinxiao

Calvin, my mistake: I just reviewed my puma.rb config and we haven’t been using fork_worker for well over one year. We use Ansible to template that puma.rb file, and that fork_worker line was inside an if clause that we currently have set to false, so it was my mistake to paste it here. I just edited the inaugural comment to remove the mention to fork_worker to not mislead anyone looking at this.
Regarding preload_app!, we can’t use it because it’s incompatible with phased restarts, which we now rely heavily; (a hot restart were stopping all our workers for 10+ seconds during the restart - requests were sitting there waiting due to socket activation; on the other hand, phased restart is flawless and doesn’t disturb our users in any way. Amazing feature.
I updated puma.rb to rescue nil the Honeybadger line, as you suggested; the Honeybadger gem doesn’t need to be started or stooped, it just works. To be honest, we can even remove the lowlevel_error_handler from our puma config.

Now that the issue is reproducible, what do you all think about the idea of checking periodically if the running threads is < than the min_threads config and, if so, respawn that worker?

feliperaul on Apr 27, 2021

Near the top of the first post, it shows a HoneyBadger log with

fatal: machine stack overflow in critical region

Server#lowlevel_error calls the handler (HoneyBadger) specified by @options[:lowlevel_error_handler], then logs the backtrace. But, with that error backtrace is nil, so Puma was throwing an error. I think.

MSP-Greg on Apr 23, 2021

@nateberkopec JSON.dump and cyclic hash can cause fatal: machine stack overflow in critical region

Being able to reproduce the exception, I found that when e.class == "fatal" there is no #backtrace. and in lowlevel_error it assumes e.backtrace is an array.

Here is the log when fatal: machine stack overflow in critical region happens

2021-04-21 09:48:15 +0800 Rack app ("GET /" - (127.0.0.1)): #<fatal: machine stack overflow in critical region>
before client_error NoMethodError
Error reached top of thread-pool: undefined method `join' for nil:NilClass (NoMethodError)
["/home/eagle/github/my/puma/lib/puma/server.rb:530:in `lowlevel_error'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:503:in `client_error'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:471:in `rescue in process_client'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:468:in `process_client'", 
"/home/eagle/github/my/puma/lib/puma/thread_pool.rb:145:in `block in spawn_thread'"]

I think the fix is to check if e.backtrace is nil.

calvinxiao on Apr 21, 2021