swarmkit: Rolling Updates: Failure Threshold

Updated Issue

Currently, all it takes for a rolling update to “fail” is a single task.

This is somewhat flawed since in fairly large deployments it’ll be fairly common to have a task failing for completely unrelated reasons.

Original Issue

In the current implementation of rolling updates, the updater will wait forever until a task moves to RUNNING. If it doesn’t, that go-routine will be stuck almost forever. Almost, because in case of manager restart or failover, rolling updates will attempt again.

We need to define a better model. Perhaps:

Time out while waiting for a task to become running
Abort update after a threshold of failures is observed
Rollback the update

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 30 (18 by maintainers)

Most upvoted comments

I’m starting to work on auto-rollback. I think there’s a decent chance we can do it for 1.12.

Here’s what I have in mind:

The service will have the previous version of the spec as well as the current version. The UpdateService controlapi handler will rotate the old version of the spec to the field containing the previous version if the spec is updated.
The service spec will contain a threshold for how many tasks must end up in a running state, given as a fraction or percentage. I want to express it this way instead of a fraction or percentage of tasks which can fail, so that a value of 0 means “never rollback”.
During a rolling update, we watch all tasks that were created by the rolling update. When any of those fail, we increment a local counter that we compare against the threshold. At the beginning of the update, we initialize the counter to the number of failed tasks with the current version of the task spec, so that if there’s a manager failover in the middle of the update, we’re still counting correctly.
When the number of failed tasks exceeds the threshold, replace the service spec with the “old” value and kick off another rolling update. The “old” value will now match the current value, so it will not be possible to oscillate between spec versions.
The orchestrator won’t know anything about health checks. All it cares about is whether containers reach the RUNNING state and stay there for the duration of the update. Once the update is over, automatic rollback can no longer be triggered.

Thoughts?

aaronlehmann on Jul 13, 2016

This is counted across tasks matching old and new spec, right?

I should have specified. It would be the number of updated tasks that get to a running state, divided by the total number of tasks (new and old). For a replicated service, the denominator would be equivalent to the replica count (but we probably wouldn’t implement it that way, to keep the updater generic).

This threshold here is the threshold provided above (in the spec) subtracted from the total tasks, right?

The threshold in the spec will be given as a fraction or percentage (haven’t decided), but yes, the condition for rollback would essentially be FailedTasks >= (1-SuccessFraction)*TotalTasks.

aaronlehmann on Jun 30, 2016