puma: Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB

Steps to reproduce

  1. rails app running in puma (Version 3.12.0 (ruby 2.5.3-p105 serving a simple text file out of /public

  2. puma running as a systemd service

[Service]
Type=forking
WorkingDirectory=/var/www/test-app/current
EnvironmentFile=/var/www/test-app/shared/.env.production
ExecStart=/bin/bash -lc 'cd /var/www/test-app/current && /usr/share/rvm/bin/rvm default do bundle exec puma --daemon --pidfile /var/www/test-app/shared/tmp/pids/puma.pid --bind "ssl://0.0.0.0:3000?key=/var/www/test-app/shared/ssl-cert-snakeoil.key&cert=/var/www/test-app/shared/ssl-cert-snakeoil.pem" --config /var/www/test-app/current/config/puma.rb --environment production --redirect-stdout /var/www/test-app/shared/log/puma.log --redirect-stderr /var/www/test-app/shared/log/puma.error.log'
PIDFile=/var/www/test-app/shared/tmp/pids/puma.pid
  1. Configure a AWS Application load balancer to accept https traffic on (443) and forward to the puma instance above via https (port 3000)

  2. Make lots of requests via a browser. Also can be automated. e.g.:
    siege -b --no-parser -t1m -c30 https://lblb......eu-central-1.elb.amazonaws.com/robots.txt | grep 502

Expected behavior

100% 200 OK

Actual behavior

A small percentage ( 5-10 requests from 4000 completed in 1 minute) for the above test fails with a 502 Bad Gateway.

System configuration

ruby 2.5.3-p105 Rails 5.2.2 puma version 3.12.0

AWS ELB logs for failed requests show the request arriving – taking <1ms to be read by AWS and <0.5s to be processed in total. They show puma connection closed without a reason and consistently processing 216 bytes from puma. Puma stdout shows nothing, neither does stderr. I’ve added x-aws-trace-id headers to my rails request logs to confirm that the rails app never gets a chance to log the failing requests.

ENV

RAILS_ENV = producton RAILS_MAX_THREADS = 8 WEB_CONCURRENCY = 6 on a t2.xlarge with nothing else running on it

puma.rb

# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count


persistent_timeout = 75 # ALB  keeps connections alive for 60s so we need to be longer then that

# Specifies the `port` that Puma will listen on to receive requests; default is 3000.
#
port        ENV.fetch("PORT") { 3000 }

# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch("RAILS_ENV") { "development" }

# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked webserver processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch("WEB_CONCURRENCY") { 2 }

# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory. If you use this option
# you need to make sure to reconnect any threads in the `on_worker_boot`
# block.
#
# preload_app!

# If you are preloading your application and using Active Record, it's
# recommended that you close any connections to the database before workers
# are forked to prevent connection leakage.
#
# before_fork do
#   ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
# end

# The code in the `on_worker_boot` will be called if you are using
# clustered mode by specifying a number of `workers`. After each worker
# process is booted, this block will be run. If you are using the `preload_app!`
# option, you will want to use this block to reconnect to any threads
# or connections that may have been created at application boot, as Ruby
# cannot share connections between processes.
#
# on_worker_boot do
#   ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
# end
#

# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 12
  • Comments: 17 (5 by maintainers)

Most upvoted comments

And now I think I understand the mechanism. The default PERSISTENT_TIMEOUT = 20 seconds in puma is less than the default idle_timeout.timeout_seconds = 60 seconds in ALBs. When switching to use TLS from the ALB to puma we effectively caused ALB/puma to start using Keep-Alives. puma closes the connection first and ALB understands the backend connection to have died and serves a 502 for one more request. My attempt to fix this ages ago had a typo but was the right solution for me (and it turns out others on my team in other contexts). It was hard for us to find because the connection needed to idle for 20-60s. Any shorter and puma would still have a connection, any longer and ALB would have dropped it.

@amichal I wonder if this is too late to chime in but i noticed perhaps the setting for persistent_timeout should be:

- persistent_timeout = 75
+ persistent_timeout 75

https://github.com/puma/puma/pull/1017

I was facing a similar issue and was thinking setting the persistent_timeout correctly should narrow or solve the problem, based on the docs from AWS:

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-502-issues

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout

The 502 on AWS ELB seem to be caused by the ssl connections being closed by puma without sending the shutdown alert, as desribed in #2675. The shutdown alerts were accidentally removed by the changes in #1334 a while ago.

Increasing the persistent_timeout above 60 seconds half-solves the problem by ensuring puma does not close connections before ELB does. However if for some reason a worker is stopped, if for example the puma-worker-killer is used, it still causes some 502 for the clients.

Our production case: Number of 502 responses per 3-hour intervals. kibana_http_502

  • Before 2021-08-25: Standard puma configuration, about 100 HTTP 502 per hour
  • Between 2021-08-25 and 2021-08-27: persistent_timeout 75, about 10 HTTP 502 per hour due to worker restarts by puma worker killer
  • Since 2021-08-27: default persistent timeout + code from #2675, less than 10 HTTP 502 per day, still correlated to worker restarts, probably when the ALB receives the SSL shutdown alert right after or at the same time as it sends the request to puma.

I’ve ignored this issue (it’s very intermittent for us) for a long time but i just wanted drop back in and say it still happens in 4.3.5. I’m trying the persistent_timeout 74 and have tried unsuccessfully playing around with workers/threads to see if its statistically related to the connection pool. We have recently learned that it seems to happen more after the site has been idle for 30s or more. Working on some more concrete tests but if anyone has a deeper understanding of what’s going on and wants some logs to track this down let me know

@MSP-Greg thanks so much for the correspondence.

I tried to find info about the AWS load balancers, and the docs I found seemed to indicate that they (at present) only support TLSv1.2?

Indeed, it is only up to TLSv1.2 for now, based on the documentations and their underlying library’s roadmap

I am totally not a TLS/SSL expert but to my understanding, it would indeed be a negotiation down (of versions). In this case, the AWS ELB is the client and requests for TLS v1.2 which Puma (at v4.3.3 with OpenSSL 1.1.1) can negotiate v1.2 with.

# I have a local puma server with SSL bindings at localhost:3001
$ sslscan --show-ciphers https://localhost:3001

...
  TLS Fallback SCSV:
Server supports TLS Fallback SCSV

  TLS renegotiation:
Session renegotiation not supported

  TLS Compression:
Compression disabled

  Heartbleed:
TLS 1.2 not vulnerable to heartbleed
TLS 1.1 not vulnerable to heartbleed
TLS 1.0 not vulnerable to heartbleed

  Supported Server Cipher(s):
Preferred TLSv1.2  256 bits  ECDHE-RSA-AES256-GCM-SHA384   Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  ECDHE-RSA-AES256-SHA384       Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  ECDHE-RSA-AES256-SHA          Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  AES256-GCM-SHA384            
Accepted  TLSv1.2  256 bits  AES256-SHA256                
Accepted  TLSv1.2  256 bits  AES256-SHA                   
Accepted  TLSv1.2  256 bits  CAMELLIA256-SHA              
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-GCM-SHA256   Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-SHA256       Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-SHA          Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  AES128-GCM-SHA256            
Accepted  TLSv1.2  128 bits  AES128-SHA256                
Accepted  TLSv1.2  128 bits  AES128-SHA                   
Accepted  TLSv1.2  128 bits  CAMELLIA128-SHA
...          

so it’d seem the negotiation should be fine (at least for most of the time).

But yes, the negotiation downwards may take time which AWS ELBs can be critical with.

The fact that the HTTP 502 is intermittent makes it hard to debug / reproduce.

I could verify with our cloudMetrics on the AWS Load Balancer that prior to HTTPS setup, no HTTP 502 occur (consistent traffic before & after).

I would try with forcing TLS v1.2 with Puma to see if it solves the issue 🙇

Similarly if it use siege to hit puma on port 3000 directly i get no 502s. The problem is the interaction between AWS ALB and puma with SSL

If i change only the --bind "ssl://0.0.0.0:3000?key=/var/www/test-app/shared/ssl-cert-snakeoil.key&cert=/var/www/test-app/shared/ssl-cert-snakeoil.pem" to --bind "tcp://0.0.0.0:3000" AWS never reports 502 errors (where never means not with 10k sustained requests with the above siege line)