suture: Handling fatal errors
At Syncthing we are using suture to manage various components of the application as services/sub-supervisors. For some of those there’s expected reasons they might trigger the entire application to shut down, for others there are circumstances where just restarting the single component/service isn’t viable (e.g. db corruption). To tackle that there has to be some way to signal a fatal error, prompting all services to be stopped. I made a prototype wrapping suture such that the serve signature becomes Serve() error and there’s a FatalErr type, such that if the returned error is of that type its supervisor stops all other services and its own Serve method returns with the same error, thus taking the entire tree down: https://github.com/syncthing/syncthing/pull/6849. Naturally it would be even nicer if such a mechanism or something completely different, achieving the same purpose, would be part of suture itself. At the same time it’s clear that the current proposal of changing Serve is a rather fundamental change.
What are your thoughts on the problem and/or whether a similar or different mechanism could fit into suture?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 43 (43 by maintainers)
I have pushed a new version of my “jcontext” branch that should recover the original functionality of suture with the new context stuff. However, it does not yet have testing around the new errors. I’m also going to be testing what happens if you get errors from the context (deadline exceeded, etc.) and do my best not to restart services just for them to be torn down immediately again, and any other corner cases I can think of. So there could still be some funkiness in the teardown procedures from cancelling a context or handling around the new errors. There’s a few other aspects I’m going to be adding testing for too.
Basically, it should “work” enough for testing purposes, and I’m interested in any odd behaviors you see if you do give it a try, but I’m explicitly not claiming it to be “done” yet, just so we’re all on the same page.