failsafe: Restrictive time window for circuit breaker to record failures

Right now it’s possible to configure failure thresholds in terms of consecutive failures:

  • withFailureThreshold(4, 5), four failures out of five consecutive executions
  • withFailureThreshold(5, 10), five failures out of ten consecutive executions
  • etc.

There is no notion of time right now. What I’m hoping for would be the ability to specify something like:

  • 10 failures within 1 second
  • 50 failures out of 200 executions within a minute

My idea behind this is to effectively disable the circuit breaker in low-traffic scenarios but if traffic suddenly increases it should start to kick in. For requests with low traffic (less than ~10 per second) we identified circuit breakers as actually making matters worse, since they tend to stay open for longer, once they open up. And ultimately circuit breaker are means to protect overloading an application which doesn’t really happen with few requests.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (11 by maintainers)

Most upvoted comments

oh ok, got it. I have always use Failsafe in a context where there is a constant flow of executions, but I see that it’s a valid concern in other scenarios. I still would want an minPeriod though, it would be useful in cases where an higher failure rate is expected at startup (“warming up” period)

This feature is potentially problematic since the threshold could easily be exceeded early on. For example, a 20% ratio would be exceeded if any of the first 5 calls fail.

Very good point, we need something to address it, this makes a lot of sense:

withFailureThreshold(failures, timePeriod) // minExecutions = failures
withFailureThreshold(failures, minExecutions, timePeriod)

I’m less sure about:

withFailureThreshold(failureRatio, minExecutions, timePeriod)

As I was mentioning in #214, number of executions is not always something we control, if a failure ratio is provided a more natural way would be:

withFailureThreshold(failureRatio, minPeriod, timePeriod)
withFailureThreshold(failureRatio, timePeriod) // minPeriod = timePeriod

Regarding the number of buckets, for users I believe it makes more sense to think about it has a time resolution (i.e how often the window move). Maybe by setting it as a number of buckets it makes the performance cost a bit more transparent but it requires a sense of how it’s implemented. I would be in favor of setting it like that:

withResolution(timePeriod)

If nothing is provided by the user, then I guess it makes sense to base it on the size of the window and have a fix number of buckets, 10 seems reasonable.

Anyway, it’s awesome to see this getting interest!