puma: Puma becomes unresponsive a while after deployment.

We are currently facing an issue where we would deploy our application on a 2-server cluster, then a while after (about a week or so), both servers will end up unresponsive. We have new-relic monitoring setup on the servers and they happen to show a hike of puma threads right when the servers crash (images below), I interpret it as being because all the available threads became unresponsive and more were spawn as a result but then they end up in the same state as the previously spawned threads and make the entire application unresponsive as well. Here is my puma.rb file:

threads 0,4
workers 4
bind "unix:///tmp/houston.puma.sock"
environment "production"
pidfile "/opt/houston/current/tmp/pids/puma.pid"
activate_control_app 'tcp://0.0.0.0:8081', { no_token: true } #This is to be able to send hot restart command by allowing app control through 8081 on localhost
directory "/opt/houston/current"
prune_bundler

# Log puma output to files
stdout_redirect '/opt/houston/current/log/puma.stdout.log', 
    '/opt/houston/current/log/puma.stderr.log', true

# Read VERSION file for git SHA
@git_sha = File.exists?('/opt/houston/current/VERSION') && 
  File.read('/opt/houston/current/VERSION')[0,8]

# Tag master process with git SHA
tag @git_sha

# Log worker boot with git SHA
on_worker_boot do
  puts "[#{Process.pid}] Booting puma worker: git rev #{@git_sha}" if @git_sha
end

When the application goes down, nginx shows:

2016/04/12 10:48:40 [error] 24457#0: *13034501 connect() to unix:/tmp/houston.puma.sock failed (111: Connection refused) while connecting to upstream, client: XX.XXX.XXX.XX, server: houston.com, request: "GET /healthcheck HTTP/1.0", upstream: "http://unix:/tmp/houston.puma.sock:/healthcheck"

On the puma logs it shows a bunch of healthchecks requests (which are litterally rendering the text “CHECK”) without showing any response:

As you can see above, all the requests are hanging… This keeps on happening until we kill the puma pid on both servers.

A couple of things to note: Our application relies on a couple of external services, with a timeout set on each of them (actually set within the application’s application_controller as an around filter to stop processing once a request takes too long to complete). We also have some healthchecks pings (3-4 per second) running on each server to make sure that the application is alive and so far we weren’t able to find a clear pattern pointing to a specific request that may have been triggering puma to become unresponsive. puma version: 2.14.0

This is the thread count and as you can see around 4am a bunch of threads were created and this is when the application crashed. screen shot 2016-04-13 at 11 42 37 am

This is the memory trend along with gc from within 8 hours screen shot 2016-04-13 at 11 44 18 am

This is the same trend expanded to 3 days for general trend.

The general drop in the graphs above is essentially us killing the pid and restarting the main puma process. Is there a known issue with puma that may trigger this? I couldn’t find anything in our source code that pointed to a memory leak though the memory seems to be climbing steadily. When attempting to reach the servers from an outside resource, all we get is a 503.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 31 (7 by maintainers)

Most upvoted comments

Have the same issue with Puma. It is running on 3 servers, and after some time Puma starts to consume 100% of a CPU and become unresponsive as well. Only thing that helps is killing the pid and restarting. Memory consumption doesn’t grow, New Relic doesn’t show anything suspicious, and Puma logs are clean as well.

Eversilence on Apr 19, 2016