swarmkit: Rolling Updates: Failure Threshold
Updated Issue
Currently, all it takes for a rolling update to “fail” is a single task.
This is somewhat flawed since in fairly large deployments it’ll be fairly common to have a task failing for completely unrelated reasons.
Original Issue
In the current implementation of rolling updates, the updater will wait forever until a task moves to RUNNING
. If it doesn’t, that go-routine will be stuck almost forever. Almost, because in case of manager restart or failover, rolling updates will attempt again.
We need to define a better model. Perhaps:
- Time out while waiting for a task to become running
- Abort update after a threshold of failures is observed
- Rollback the update
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 30 (18 by maintainers)
I’m starting to work on auto-rollback. I think there’s a decent chance we can do it for 1.12.
Here’s what I have in mind:
UpdateService
controlapi handler will rotate the old version of the spec to the field containing the previous version if the spec is updated.Thoughts?
I should have specified. It would be the number of updated tasks that get to a running state, divided by the total number of tasks (new and old). For a replicated service, the denominator would be equivalent to the replica count (but we probably wouldn’t implement it that way, to keep the updater generic).
The threshold in the spec will be given as a fraction or percentage (haven’t decided), but yes, the condition for rollback would essentially be
FailedTasks >= (1-SuccessFraction)*TotalTasks
.