puma: Requests fail during phased restarts, hot restarts, and shutdown
Describe the bug
During a phased restart, some clients do not get successful responses from the server.
Requests fail in a variety of ways. The ones I’ve observed and that I’m able to reproduce reliably are these:
- The server closes the connection without first writing a response onto the socket
curl: (52) Empty reply from server
curl: (56) Recv failure: Connection reset by peer
- The exception
Unable to add work while shutting down
is raised, invoking thelowlevel_error_handler
(by default, this returns a response with response code 500)
Puma config:
port 3000
workers 2
PUMA_MASTER_PIDFILE=/usr/src/app/puma.pid
bundle exec puma --pidfile "${PUMA_MASTER_PIDFILE}"
To Reproduce
I’ve created a repo that reliably reproduces the errors: https://github.com/cjlarose/puma-phased-restart-errors
Expected behavior
All requests are passed to the application and the application’s responses are all delivered to the requesting clients. The restart documentation for puma suggests that phased restarts are meant to have zero downtime.
Desktop (please complete the following information):
- OS: Linux
- Puma Version: 4.3.5
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 18 (14 by maintainers)
Just wanted to post an update on this: my company is spending resources on this since it’s important to us that this is fixed. We have tried several approaches that have failed in various ways, but we think we’re close to getting a solution that we’re happy with. We’ll open a PR soon with a new integration test and a fix for all 3 of the problems in the issue description.
Also I should mention that root issue a little more broad than the title of this issue implies. We’ve demonstrated that requests can fail during normal “hot” restarts on a non-cluster puma deployment (single server process), as long as you’re using
queue_requests
(which is on by default); it’s not really related to phased restarts directly.Worth to mention one interesting finding I’ve seen while testing
queue_requests
in the setup Nginx + multiple Puma instances.The below graphs show response time when there are more requests then puma can handle (4 pods with 4 threads each, request that take 1 second, 18 requests per second). Spikes at the beginning is rolling deploy effect.
Queue requests enabled:
Queue requests disabled:
I think disabling queue - is still the way to go in setup if there is Nginx in front and you have longer requests. It produces more predictable results, prevents overload of any particular instance (taking too much requests puma will not be able to handle in grace shutdown period), and allows to balance the load evenly across puma instances.
Thanks so much @cjlarose! I appreciate Appfolio’s continued investment in the ecosystem!
Just tested the 5.0 puma in our setup. I can confirm that the
Unable to add work while shutting down
error is not reproducible anymore. And overall - I can see much fewer errors related to the shutdown (unprocessed requests in sockets).Big thanks to all Puma contributors for your hard work.
I have been investigating this problem for the last few days. In our case, it is the #2343 issue. However, will share findings here as it can help your work.
I am seeing very similar issues mentioned in the first comment. From the Nginx side they spelled as:
connect() failed (111: Connection refused) while connecting to upstream
upstream timed out (110: Operation timed out) while connecting to upstream
- happens when the socket have too many requests and puma is not picking up requests in proxy_connect timeoutfailed (104: Connection reset by peer) while reading response header from upstream
- happens when the socket have too many requests and puma shuts downupstream prematurely closed connection while reading response header from upstream
- happens whendrain_on_shutdown
is disabled withqueue_requests
enabled. Puma read the request but hasn’t sent anything in response during the shutdown.Unable to add work while shutting down
- concurrency issue inside pumaI can easily reproduce these errors with 1s requests overflowing multiple puma pods in non-cluster mode and one of the pods going through normal k8s restart.
I can confirm that
Unable to add work while shutting down
- is solved by disablingqueue_requests
. Looking through the code there is some concurrency in pushing queued requests after ThreadPool in shutdown mode.I also experimented with
drain_on_shutdown
. This setting makes puma not process anything in the socket. Have you tried disabling it and see if restart issues are still reproducible?A colleague and I are looking into fixing the
Unable to add work while shutting down
problem (Scenario 3). We’ve got a plan we’re working on and hope to open a PR at some point.