puma: Puma can hang on IO.select

Steps to reproduce

  1. Create Rails app

  2. Deploy it following this tutorial: https://www.digitalocean.com/community/tutorials/deploying-a-rails-app-on-ubuntu-14-04-with-capistrano-nginx-and-puma

Expected behavior

Work nice and smoothly

Actual behavior

The server works well for a few hours/days, then it suddenly “stops”. htop shows that Puma is still working, but Nginx gives this error:

2018/08/04 09:55:00 [error] 889#889: *1241 connect() to unix:///home/app/shared/tmp/sockets/app-puma.sock failed (111: Connection refused) while connecting to upstream, client: 197.14.151.135, server: server.com, request: "GET /my_route HTTP/1.1", upstream: "http://unix:///home/app/shared/tmp/sockets/app-puma.sock:/500.html", host: "server.com"

When doing ls /home/app/shared/tmp/sockets/, app-puma.sock is present

And there is only this in puma.error.log:

=== puma startup: 2018-07-19 10:13:17 +0000 ===
I, [2018-07-19T10:13:19.790942 #32107]  INFO -- sentry: ** [Raven] Raven 2.7.3 ready to catch errors
* Listening on unix:///home/app/shared/tmp/sockets/app-puma.sock
* Restarting...
Refreshing Gemfile
Puma starting in single mode...
* Version 3.11.3 (ruby 2.4.4-p296), codename: Love Song
* Min threads: 4, max threads: 16
* Environment: production
* Daemonizing...

To fix it, I can’t even use cap production deploy:restart. I have to login on the server, kill -9 <puma_pid> and then restart the app from Capistrano.

Honestly I have no idea where this is coming from. I tried it on different servers with the same problem.

I first thought it was my server “stats” (2 GB memory | 1 vCPU | 50 GB SSD) that were low, but it makes no sense since the app is running locally on my Rpi.

System configuration

Ruby version: ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-linux] Rails version: 5.1.6 Puma version: 3.12.0

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 7
  • Comments: 18 (8 by maintainers)

Most upvoted comments

It could be that you’re hitting max open files limit if your service opens too many connections to upstream services or keeps to many files open. In that case puma will not raise any error but silently reject/timeout connections. Hence the intermittent behaviour when the problem only start to appear after certain number of requests.

To test that you can run your service with a single puma worker and check open files with lsof and try increasing ulimit

At the very least there are two unrelated symptoms here, because the original issue was reported a year ago before puma use nio. So I can’t comment much more than that without some more details.

Regarding select hang hypothesis can you use strace on Linux to capture log of system calls?

I seems to be able to dig down to the cause of the hang.

TL;DR The hang is from Kernel#select in NIO gems. The TCPSocket is not ready for some reason. Hence, timeout when Kernel#select.

Context:

In my machine (MacOSX 10.14.6), Puma hangs after receiving a certain amount of requests. The number of requests is random. Usually less than 200 requests, around ~150 - 170. My only fix is to restart the host then Puma become healthy. I run Puma in a single mode (1 worker), with min/max 5 threads.

At first I thought it was something inside Puma but after prying everywhere in Puma server/event/etc. 😂 In my case, I found that Puma hangs at selector.select in Reactor#run_internal which is using NIO::Selector internally.

At this point it seems like this stop the world bug has nothing to do with Puma but I think I should post here first in case someone here know how is it possible that IO.select with TCPServer socket can hang.

Continue.

I’ve been messing around with NIO until I find a way to change from using libev backend to native Ruby. Finally, I found the line that cause the hang Kernel#select. It stops since all readers IO are not ready for some reasons.

My current knowledge ends here. If anyone can point me to the right direction to debug this I would really appreciated.

PS. I think the cause be from my machine setting not the Puma or even Kernel#select but I would like to know why.

Closing, not enough to go on here. If someone can produce a case that reliably hangs Puma at a certain spot, we can reopen.

I don’t find a way to fix it. I migrate my application from Azure to AWS and change the Puma to Passenger and now everything works fine. No job stop to work anymore.

If you don’t have time/energy/patience to deal with this. Just Use Docker, Add a simple CURL healthcheck. so Docker would restart it for you.

(I am not the maintainer or a contributor of this project, I am just some guy pass by trying to help)

I recently learn docker and adopt it into my workflow. for me(1 person & 1 production machine) it works perfectly. save me a lot of time on deploy & maintain environment

When a process (not just Puma, but any process) freezes and can’t be stopped without kill -9, what I personally do is:

  • sudo gdb --pid=<PID of defunct process>
  • thread apply all bt

…That will show you what each thread is doing. If you can’t make sense of the output, copy the whole thing (each page) and share it here. Usually you will find that some thread is stuck trying to do an I/O operation which never completes, and the name of the C function which it is in will give you a clue what kind of I/O it is (perhaps trying to talk to another process).

You need to have GDB installed on your server. Also, your Ruby interpreter must be built with debugging symbols (but I have always found that to be the case).