apns2: c.HTTPClient.Do (in c.Push) hanging intermittently
As a preface, I’m reporting this to open a dialog. I think the ultimate problem will be with something I’m doing or a bug deep within the http2 libs of Go.
I’ve noticed that occasionally the call to c.HTTPClient.Do
hangs indefinitely. The problem occurs intermittently and seemingly not because of the certificate used in the connection. Given enough retries (where the connection is remade by making a new handle on apns2.Client
), it will succeed without error.
I’m not convinced this is a network issue. It seems something is deadlocking within the http2 libs. I set c.HTTPClient.Timeout
to 1 second, which never triggers. Additionally I spin up a timeout goroutine of 3 seconds, which is how I determine that something is hanging, and at which point I attempt a retry. (As a side note, I just realized this may not cleanly kill the connection. Perhaps I should call CloseIdleConnections
on the http2 transport?) It doesn’t seem to be a network issue. Despite setting GODEBUG=http2debug=2
, no http2 logs are outputted.
My sender is massively concurrent, having thousands of goroutines at any moment, but my interpretation is that if it gets to c.HTTPClient.Do
, which is documented to be thread-safe, and then hangs, then there is a problem within the http2 libs.
Am I weirdly running out of possible connections or something? Does anyone have thoughts on this?
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 1
- Comments: 19 (15 by maintainers)
It seems to me that there are 2 problems:
I might be a good idea to include a default timeout in apns2 until that is hopefully added in go. Otherwise any use of apns2 is broken by default until the user of the library adds their own timeout.
@zjx20 Today we have experienced another halt despite giving a timeout to TLS handshake and based on your reporting, we gave giving a timeout to
http.Client
a try. Just a few hours later, we started receiving client timeouts for the same certificate. Thus, I can also confirm that lack of a timeout forhttp.Client
can also cause infinite locks.Here are a few more findings and tests:
json.Decode
instead ofioutil.ReadAll
. I’ve changed thePush
function to use the latter, alas the problem persists.Push
, checking if the error string containsDialWithDialer
and retrying the push (which retriggersDialTLS
) seems to work well. So far, no more than 2 tries were required.@sideshow If it is agreed upon that timing out
DialTLS
fixes hanging problems and has no side-effects, we might even consider integrating the retry logic withinPush
to make the process transparent to package users.Edit: It seems like the error string can also contain
i/o timeout
. We are not sure what makes them alternate.Edit 2: We have now started seeing some cases where redials do not end up opening a successful connection. Said posts randomly start working after a few minutes. Even though these posts don’t end up succeeding, at least they are not hanging around forever.
Obviously related to https://github.com/sideshow/apns2/issues/17