puma: Requests fail during phased restarts, hot restarts, and shutdown

Describe the bug

During a phased restart, some clients do not get successful responses from the server.

Requests fail in a variety of ways. The ones I’ve observed and that I’m able to reproduce reliably are these:

The server closes the connection without first writing a response onto the socket curl: (52) Empty reply from server
curl: (56) Recv failure: Connection reset by peer
The exception Unable to add work while shutting down is raised, invoking the lowlevel_error_handler (by default, this returns a response with response code 500)

Puma config:

port 3000
workers 2

PUMA_MASTER_PIDFILE=/usr/src/app/puma.pid
bundle exec puma --pidfile "${PUMA_MASTER_PIDFILE}"

To Reproduce

I’ve created a repo that reliably reproduces the errors: https://github.com/cjlarose/puma-phased-restart-errors

Expected behavior

All requests are passed to the application and the application’s responses are all delivered to the requesting clients. The restart documentation for puma suggests that phased restarts are meant to have zero downtime.

Desktop (please complete the following information):

OS: Linux
Puma Version: 4.3.5

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 18 (14 by maintainers)

Most upvoted comments

Just wanted to post an update on this: my company is spending resources on this since it’s important to us that this is fixed. We have tried several approaches that have failed in various ways, but we think we’re close to getting a solution that we’re happy with. We’ll open a PR soon with a new integration test and a fix for all 3 of the problems in the issue description.

Also I should mention that root issue a little more broad than the title of this issue implies. We’ve demonstrated that requests can fail during normal “hot” restarts on a non-cluster puma deployment (single server process), as long as you’re using queue_requests (which is on by default); it’s not really related to phased restarts directly.

cjlarose on Sep 16, 2020

Worth to mention one interesting finding I’ve seen while testing queue_requests in the setup Nginx + multiple Puma instances.

The below graphs show response time when there are more requests then puma can handle (4 pods with 4 threads each, request that take 1 second, 18 requests per second). Spikes at the beginning is rolling deploy effect.

Queue requests enabled: queue-true-drain-true

Queue requests disabled: queue-false-drain-true

I think disabling queue - is still the way to go in setup if there is Nginx in front and you have longer requests. It produces more predictable results, prevents overload of any particular instance (taking too much requests puma will not be able to handle in grace shutdown period), and allows to balance the load evenly across puma instances.

volodymyr-mykhailyk on Sep 25, 2020

Thanks so much @cjlarose! I appreciate Appfolio’s continued investment in the ecosystem!

nateberkopec on Sep 16, 2020

Just tested the 5.0 puma in our setup. I can confirm that the Unable to add work while shutting down error is not reproducible anymore. And overall - I can see much fewer errors related to the shutdown (unprocessed requests in sockets).

Big thanks to all Puma contributors for your hard work.

volodymyr-mykhailyk on Sep 25, 2020

I have been investigating this problem for the last few days. In our case, it is the #2343 issue. However, will share findings here as it can help your work.

I am seeing very similar issues mentioned in the first comment. From the Nginx side they spelled as:

connect() failed (111: Connection refused) while connecting to upstream
upstream timed out (110: Operation timed out) while connecting to upstream - happens when the socket have too many requests and puma is not picking up requests in proxy_connect timeout
failed (104: Connection reset by peer) while reading response header from upstream - happens when the socket have too many requests and puma shuts down
upstream prematurely closed connection while reading response header from upstream - happens when drain_on_shutdown is disabled with queue_requests enabled. Puma read the request but hasn’t sent anything in response during the shutdown.
Unable to add work while shutting down - concurrency issue inside puma

I can easily reproduce these errors with 1s requests overflowing multiple puma pods in non-cluster mode and one of the pods going through normal k8s restart.

I can confirm that Unable to add work while shutting down - is solved by disabling queue_requests. Looking through the code there is some concurrency in pushing queued requests after ThreadPool in shutdown mode.

I also experimented with drain_on_shutdown. This setting makes puma not process anything in the socket. Have you tried disabling it and see if restart issues are still reproducible?

volodymyr-mykhailyk on Sep 17, 2020

A colleague and I are looking into fixing the Unable to add work while shutting down problem (Scenario 3). We’ve got a plan we’re working on and hope to open a PR at some point.

cjlarose on Aug 24, 2020