http: Upgrade to 5.0.2 potentially caused hanging threads

Hello,

We recently upgraded to 5.0.2 and almost immediately started to notice major degradation on one of our Sidekiq queues that make heavy usage of HTTP. I have attached a screenshot that shows our jobs.wait metric - how long a job takes to complete from the time it was enqueued, in ms for the last four weeks. We started seeing a lot of “dead” threads, that is jobs making external HTTP requests, that would hang indefinitely and exhaust our worker pool, until we restarted dynos, which would immediately resolve after restart. The vertical line shows when the release with the HTTP upgrade was rolled out. I downgraded back to 5.0.1 today and we thus far have observed no dead threads.

I strongly suspect a regression in this pr, but haven’t yet been able to build a reproducible test case due to the intermittent nature of external resources and the concurrent access to them. I’m happy to provide additional information and welcome any tips on how I can troubleshoot. We do use the global .timeout(20) in these requests.

Thanks!

Screen Shot 2021-09-28 at 6 12 00 PM

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 27 (25 by maintainers)

Most upvoted comments

@midnight-wonderer but do you agree that after #638 the DNS lookup is done outside of timeout block? So if it hangs for some reason, it’s going to hang forever?

@jordanstephens for me it looks like you are right and this is how PerOperation works now. This doesn’t explain why threads could be hanging forever, though.

Didn’t #638 change the meaning of the timeout parameter during connection?

Before #638, DNS resolution took place within the scope of connection timeout block (via getaddrinfo during TCPSocket.new). Now it’s taking place outside the scope of the connect timeout. After #638, if the hostname resolved to multiple addresses, and the timeout is reached, it will get restarted, and another attempt will be made to the next address.

So, for example, if we are trying to connect to a hostname which resolves to 8 addresses and our connect timeout is 30s, then the maximum runtime (if none of the hosts are accepting requests) has increased from 30s to 240s.

Should that have been a breaking change? Have I misunderstood anything here?

Note: #695 shipped in v5.0.3.

~I believe the combination of log_time and @time_left prevents the “resetting” of the timeout that @jordanstephens describes.~ Regardless, I can’t get to the bottom of this, so I think it would be a wise decision to revert: https://github.com/httprb/http/pull/695.

Surely we can just rely on BGP to route traffic to proper addresses…right? 😉

EDIT: It is correct however that DNS resolution takes place outside the scope of the timeout block…perhaps that is to blame here.

EDIT 2: Axed the above, if Timeout is raised, log_time would never be called and thus the connect timeout is actually fresh for every address.

Seems like we should just revert this for now.

@PhilCoggins mind opening a PR?

So i guess the problem is that for some reason it starts raising and retrying forever in some situations.

It will only retry N + (Number of A records returned from Resolv.getaddresses(host_name)) times. After the final iteration, ::StopIteration raises and then it re-raises the last error that occurred.