puma: Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB
Steps to reproduce
-
rails app running in
puma (Version 3.12.0 (ruby 2.5.3-p105
serving a simple text file out of/public
-
puma running as a systemd service
[Service]
Type=forking
WorkingDirectory=/var/www/test-app/current
EnvironmentFile=/var/www/test-app/shared/.env.production
ExecStart=/bin/bash -lc 'cd /var/www/test-app/current && /usr/share/rvm/bin/rvm default do bundle exec puma --daemon --pidfile /var/www/test-app/shared/tmp/pids/puma.pid --bind "ssl://0.0.0.0:3000?key=/var/www/test-app/shared/ssl-cert-snakeoil.key&cert=/var/www/test-app/shared/ssl-cert-snakeoil.pem" --config /var/www/test-app/current/config/puma.rb --environment production --redirect-stdout /var/www/test-app/shared/log/puma.log --redirect-stderr /var/www/test-app/shared/log/puma.error.log'
PIDFile=/var/www/test-app/shared/tmp/pids/puma.pid
-
Configure a AWS Application load balancer to accept https traffic on (443) and forward to the puma instance above via https (port 3000)
-
Make lots of requests via a browser. Also can be automated. e.g.:
siege -b --no-parser -t1m -c30 https://lblb......eu-central-1.elb.amazonaws.com/robots.txt | grep 502
Expected behavior
100% 200 OK
Actual behavior
A small percentage ( 5-10 requests from 4000 completed in 1 minute) for the above test fails with a 502 Bad Gateway.
System configuration
ruby 2.5.3-p105 Rails 5.2.2 puma version 3.12.0
AWS ELB logs for failed requests show the request arriving – taking <1ms to be read by AWS and <0.5s to be processed in total. They show puma connection closed without a reason and consistently processing 216 bytes from puma. Puma stdout shows nothing, neither does stderr. I’ve added x-aws-trace-id headers to my rails request logs to confirm that the rails app never gets a chance to log the failing requests.
ENV
RAILS_ENV = producton RAILS_MAX_THREADS = 8 WEB_CONCURRENCY = 6 on a t2.xlarge with nothing else running on it
puma.rb
# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count
persistent_timeout = 75 # ALB keeps connections alive for 60s so we need to be longer then that
# Specifies the `port` that Puma will listen on to receive requests; default is 3000.
#
port ENV.fetch("PORT") { 3000 }
# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch("RAILS_ENV") { "development" }
# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked webserver processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch("WEB_CONCURRENCY") { 2 }
# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory. If you use this option
# you need to make sure to reconnect any threads in the `on_worker_boot`
# block.
#
# preload_app!
# If you are preloading your application and using Active Record, it's
# recommended that you close any connections to the database before workers
# are forked to prevent connection leakage.
#
# before_fork do
# ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
# end
# The code in the `on_worker_boot` will be called if you are using
# clustered mode by specifying a number of `workers`. After each worker
# process is booted, this block will be run. If you are using the `preload_app!`
# option, you will want to use this block to reconnect to any threads
# or connections that may have been created at application boot, as Ruby
# cannot share connections between processes.
#
# on_worker_boot do
# ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
# end
#
# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 12
- Comments: 17 (5 by maintainers)
And now I think I understand the mechanism. The default
PERSISTENT_TIMEOUT = 20
seconds in puma is less than the defaultidle_timeout.timeout_seconds = 60
seconds in ALBs. When switching to use TLS from the ALB to puma we effectively caused ALB/puma to start using Keep-Alives. puma closes the connection first and ALB understands the backend connection to have died and serves a 502 for one more request. My attempt to fix this ages ago had a typo but was the right solution for me (and it turns out others on my team in other contexts). It was hard for us to find because the connection needed to idle for 20-60s. Any shorter and puma would still have a connection, any longer and ALB would have dropped it.@amichal I wonder if this is too late to chime in but i noticed perhaps the setting for
persistent_timeout
should be:I was facing a similar issue and was thinking setting the
persistent_timeout
correctly should narrow or solve the problem, based on the docs from AWS:https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-502-issues
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout
The 502 on AWS ELB seem to be caused by the ssl connections being closed by puma without sending the shutdown alert, as desribed in #2675. The shutdown alerts were accidentally removed by the changes in #1334 a while ago.
Increasing the
persistent_timeout
above 60 seconds half-solves the problem by ensuring puma does not close connections before ELB does. However if for some reason a worker is stopped, if for example thepuma-worker-killer
is used, it still causes some502
for the clients.Our production case: Number of 502 responses per 3-hour intervals.
persistent_timeout 75
, about 10 HTTP 502 per hour due to worker restarts by puma worker killerI’ve ignored this issue (it’s very intermittent for us) for a long time but i just wanted drop back in and say it still happens in 4.3.5. I’m trying the
persistent_timeout 74
and have tried unsuccessfully playing around with workers/threads to see if its statistically related to the connection pool. We have recently learned that it seems to happen more after the site has been idle for 30s or more. Working on some more concrete tests but if anyone has a deeper understanding of what’s going on and wants some logs to track this down let me know@MSP-Greg thanks so much for the correspondence.
Indeed, it is only up to TLSv1.2 for now, based on the documentations and their underlying library’s roadmap
I am totally not a TLS/SSL expert but to my understanding, it would indeed be a negotiation down (of versions). In this case, the AWS ELB is the client and requests for TLS v1.2 which Puma (at v4.3.3 with OpenSSL 1.1.1) can negotiate v1.2 with.
so it’d seem the negotiation should be fine (at least for most of the time).
But yes, the negotiation downwards may take time which AWS ELBs can be critical with.
The fact that the HTTP 502 is intermittent makes it hard to debug / reproduce.
I could verify with our cloudMetrics on the AWS Load Balancer that prior to HTTPS setup, no HTTP 502 occur (consistent traffic before & after).
I would try with forcing TLS v1.2 with Puma to see if it solves the issue 🙇
Similarly if it use siege to hit puma on port 3000 directly i get no 502s. The problem is the interaction between AWS ALB and puma with SSL
If i change only the
--bind "ssl://0.0.0.0:3000?key=/var/www/test-app/shared/ssl-cert-snakeoil.key&cert=/var/www/test-app/shared/ssl-cert-snakeoil.pem"
to--bind "tcp://0.0.0.0:3000"
AWS never reports502
errors (where never means not with 10k sustained requests with the above siege line)