puma: Performance degradation with multithreading and connection pools

I maintain several of the Ruby framework samples for TechEmpower’s FrameworkBenchmarks. The latest Round 14 Preview 1.1 numbers reflect very favorably on all the perf work that’s been going on in the Ruby community over the past few months and years. However, there is an anomaly in Puma’s numbers that may be worth digging into!

Context

Round 14 Preview 1.1 was executed on a machine with 80 hardware threads (and some ungodly amount of RAM).

Puma, Unicorn, and Passenger were configured to run with 100 processes.
Puma was additionally configured to run with four threads (min and max) per process.
Under Puma, Sequel and ActiveRecord were configured to use connection pools with four slots.
Under Unicorn and Passenger, Sequel and ActiveRecord were configured to run single-threaded.
The wrk benchmark was executed against these instances with a concurrency of 256.

I’ll be talking about YARV/MRI 2.4 throughout the rest of this report unless otherwise specified. (There is a JRuby/TorqueBox benchmark but it has some other undiagnosed performance issues.)

The Good

On most of the tests, Puma wins handily thanks to its increased concurrency, even on CPU-bound tasks. For example, JSON serialization under Sinatra:

And single SELECT queries against PostgreSQL under Roda/Sequel (note the FE column):

We can see in that last example that Puma pulls ahead despite the overhead of Sequel automatically checking connections in to and out from its pool. The route being called looks like this (Roda automatically serializes the resulting Hash object to JSON):

# Test type 2: Single database query
static_get '/db' do
  World.with_pk(rand(10_000).succ).values
end

The Bad

However, the wheels start to come off as soon as we start executing more than one query in a route. For example, multiple SELECT queries against PostgreSQL under Roda/Sequel:

Oops! That’s not good at all. Here’s roughly what that code looks like:

# Test type 3: Multiple database queries
static_get '/queries' do
  Array.new(20) do
    World.with_pk(rand(10_000).succ).values
  end
end

Note: I am able to reproduce this “anomaly” locally, and wrapping a Sequel::Database#synchronize around the loop (to ensure a single database connection is checked out from the pool and used for all 20 queries) improves the performance of this route by 10%, still 30% behind Unicorn and Passenger on my machine.

This performance disparity can’t be explained by the overhead of Sequel’s connection pool as shown in the prior example of single-query performance. We can see it in ActiveRecord, too; multiple SELECT and UPDATE queries performed against MySQL under Sinatra/ActiveRecord:

Here’s roughly the code for that route:

# Test type 5: Database updates
get '/updates' do
  worlds =
    ActiveRecord::Base.connection_pool.with_connection do
      Array.new(20) do
        world = World.find(rand(10_000).succ)
        world.update(:randomnumber=>rand(10_000).succ)
        world.attributes
      end
    end

  json worlds
end

The Ugly

It seems like the issue manifests from a “tight” loop of queries occurring in a single request thread with a connection checked out from a pool. When the queries are distributed across multiple request threads, Puma plays nice with a connection pool. When Sequel or ActiveRecord are single-threaded, there are no performance issues whatsoever. That’s true even for Puma: single-threaded Puma beats Unicorn and Passenger in these benchmarks (on my machine)!

This is particularly pernicious because it’s exactly the use case where you would expect Puma’s multithreading to help you, not hurt you: lots of IO-bound tasks. In the real world, we could easily solve this problem by rewriting these routes to perform a single SELECT .. WHERE .. IN query. However, I think these benchmarks raise an interesting question: Is this a pathological case that can’t be fixed without changing the application, or does it demonstrate an edge case that could be quietly afflicting thousands of Ruby applications running under Puma?

Updates

(25-Mar) Interestingly, a SELECT .. WHERE .. IN is also slower (fewer requests/second) on Puma—multi-threaded (1,443) cf. single-threaded (2,200). So the issue doesn’t appear to be strictly related to a “tight” loop of IO-bound tasks.
(25-Mar) I was able to reproduce my findings under Ruby 2.2.6 and Puma 2.15.3—multi-threaded (256) cf. single-threaded (387)—so the performance anomaly (if there truly is one) has been around for some time.
(25-Mar) Using a connection pool with fewer slots than Puma’s thread depth seems to give decent compromise numbers. I can get within 20% of single-threaded performance using a connection pool size of floor(2*ln(x)), but that’s just a wild guess at a formula based on what I’m seeing at a variety of Puma thread depths.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 6
Comments: 21 (19 by maintainers)

Most upvoted comments

@jrafanie Tests were run under Puma, Unicorn, and Passenger, with Sequel and ActiveRecord, against MySQL and PostgreSQL. All the configuration is available at the linked GitHub repository, and all the benchmark data is available at the linked Round 14 Preview 1.1 results.

mwpastore on Mar 24, 2017

Fixed in #2079.

nateberkopec on May 11, 2020

I’m probably going to close this with 5.0 and the various perf/balancing PRs that are going to be involved with that release.

nateberkopec on Mar 5, 2020

One key detail about this TechEmpower Framework benchmark I didn’t realize earlier- the wrk performance tests use keepalive on its HTTP connections. In Puma, incoming connections are routed to a worker-process when first accepted and are stuck on that process until closed.

So changes in connection-balancing can only impact this benchmark by routing the initial connections across processes as evenly as possible. However, an optimal solution would be to move the Reactor (which buffers requests and manages keepalive connections) into the parent process (or a separate layer of workers), and then send individual, load-balanced HTTP requests across processes.

An alternative solution would be to modify Puma to have each worker-process listen to separate sockets, and run an upsteam proxy (nginx or haproxy) to manage the client connections and load-balance individual incoming requests across the set of Puma-worker sockets directly.

wjordan on Dec 18, 2019

I put together a first attempt at partially addressing this issue in PR #1646.

Rather than adding extra communication/synchronization between the workers or proxying requests through the master, I found that non-idle workers can do a non-blocking poll on the socket for queued work as a form of indirect communication, keeping the PR relatively simple compared to other alternatives.

This seems to help balance load across processes slightly and gets much closer to single-thread performance in my tests, but the caveat of this simple implementation is that it’s only effective when the average thread depth is <= 1. Balancing requests across higher thread-depths might result in diminishing returns anyway, due to the GVL.

@mwpastore I’d be interested in hearing if this PR has any impact on the benchmarks you’ve been checking in this issue. In my local tests against roda-sequel-postgres I’m seeing some improvement.

wjordan on Sep 12, 2018

The reason for the database latency spikes is that while the thread that did the query was waiting, another thread began to run. And only once the data from the query was returned did the original thread go back into the run queue. Because you’re traffic load is causing all threads to run, there is just more perceived time waiting for the database to return, when really what you’re seeing is thread scheduling delay. The question is: are you ok with that?

In the unicorn case, the worker process can monopolize the database connection all to itself, so when the database results come back, it begins to work immediately.

evanphx on Sep 11, 2018