girc: execLoop and sendLoop not getting closed if network connection is low-quality
It appears that if a low-quality network connection (e.g. being behind a Whonix Tor transproxy) is being used, the execLoop
and/or sendLoop
goroutines aren’t reliably being closed properly after a disconnect.
Originally reported at https://github.com/42wim/matterbridge/issues/1743 (detailed logs are in that issue) ; @42wim suggested that I report it here. I’m happy to help with trying to narrow down the bug, e.g. by collecting more detailed logs, but my ability to help may be limited since I’m not the author of the affected software. In any event please let me know what I can do to help get this fixed.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 20 (13 by maintainers)
Commits related to this issue
- Update girc dependency again Includes an important fix for a potential hangup after ping timeout. See https://github.com/lrstanley/girc/issues/50 — committed to nmeum/hii by nmeum 2 years ago
- fix(core): better contexst cancellation flow for connections in an attempt to fix #50 - this commit adds various timeout and context flows to ensure goroutines don't stay waiting on a resource Signe... — committed to lrstanley/girc by lrstanley 2 years ago
Another item – from that goroutine dump, looks like there are a lot of handlers inside of matterbridge which are waiting to send data to some channel inside of matterbridge itself. I am not sure if that’s directly related to this issue, or a cause for another issue. Non-background handlers registered with girc will cause girc to block until they’ve completed, including waiting for the client to close, whereas background handlers are fire and forget.
I’ve pushed https://github.com/lrstanley/girc/commit/da08de25b4527e50247dc3f8afd589c73b5e0634 to branch
bugfix/issue-50
, with what I think is a sufficient enough fix, however it’s not very clean, and I haven’t extensively tested it.girc was initially released a few months after
context.Context
was added to the standard library. As such, my original adoption was very limited (and I really wish I had rewritten to support it more). Much of the core functionality isn’t the most idiomatic Go due to that, and me not using channels properly back then.I’d update to be more idiomatic, however it’d break the API, and frankly, it’d be a lot of work. So here is to hoping this fix works. 😄
Sorry for the delay – taking a look now.
Here’s a goroutine dump of the hang, with Matterbridge 81e6f75aa491c7bfa11780dd622587cf5fff3c8f:
@lrstanley thanks for taking a look! @JeremyRand you can find binaries with the patch in on https://github.com/42wim/matterbridge/actions/runs/2019172615 or build from https://github.com/42wim/matterbridge/pull/1773
I believe I’ve figured out what the issue is, and put a potential patch in with https://github.com/lrstanley/girc/commit/887ab30b9028c8c8f178aa73dcac50366a681052 – if @42wim can make a temporary build with these changes, to test things out, that would help.
If the issue is still occurring, I’d likely need a goroutine dump during the hang to be able to best troubleshoot it. That would help determine what is causing the goroutine(s) to hang.