http: Upgrade to 5.0.2 potentially caused hanging threads
Hello,
We recently upgraded to 5.0.2 and almost immediately started to notice major degradation on one of our Sidekiq queues that make heavy usage of HTTP
. I have attached a screenshot that shows our jobs.wait
metric - how long a job takes to complete from the time it was enqueued, in ms for the last four weeks. We started seeing a lot of “dead” threads, that is jobs making external HTTP requests, that would hang indefinitely and exhaust our worker pool, until we restarted dynos, which would immediately resolve after restart. The vertical line shows when the release with the HTTP upgrade was rolled out. I downgraded back to 5.0.1 today and we thus far have observed no dead threads.
I strongly suspect a regression in this pr, but haven’t yet been able to build a reproducible test case due to the intermittent nature of external resources and the concurrent access to them. I’m happy to provide additional information and welcome any tips on how I can troubleshoot. We do use the global .timeout(20)
in these requests.
Thanks!

About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 27 (25 by maintainers)
@midnight-wonderer but do you agree that after #638 the DNS lookup is done outside of timeout block? So if it hangs for some reason, it’s going to hang forever?
@jordanstephens for me it looks like you are right and this is how
PerOperation
works now. This doesn’t explain why threads could be hanging forever, though.Didn’t #638 change the meaning of the
timeout
parameter during connection?Before #638, DNS resolution took place within the scope of connection timeout block (via
getaddrinfo
duringTCPSocket.new
). Now it’s taking place outside the scope of the connect timeout. After #638, if the hostname resolved to multiple addresses, and the timeout is reached, it will get restarted, and another attempt will be made to the next address.So, for example, if we are trying to connect to a hostname which resolves to 8 addresses and our connect timeout is 30s, then the maximum runtime (if none of the hosts are accepting requests) has increased from 30s to 240s.
Should that have been a breaking change? Have I misunderstood anything here?
Note: #695 shipped in v5.0.3.
~I believe the combination of
log_time
and@time_left
prevents the “resetting” of the timeout that @jordanstephens describes.~ Regardless, I can’t get to the bottom of this, so I think it would be a wise decision to revert: https://github.com/httprb/http/pull/695.Surely we can just rely on BGP to route traffic to proper addresses…right? 😉
EDIT: It is correct however that DNS resolution takes place outside the scope of the timeout block…perhaps that is to blame here.
EDIT 2: Axed the above, if Timeout is raised,
log_time
would never be called and thus the connect timeout is actually fresh for every address.Seems like we should just revert this for now.
@PhilCoggins mind opening a PR?
It will only retry N + (Number of A records returned from
Resolv.getaddresses(host_name)
) times. After the final iteration,::StopIteration
raises and then it re-raises the last error that occurred.