etcd: Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes
What would you like to be added?
This is etcd side bug for https://github.com/kubernetes/kubernetes/issues/118501
tldr; Etcd restore operation breaks existing etcd guarantees about revision never decreasing.
Proposal
- Etcd restore operation should support bumping revision of the next etcd operations by an arbitrary number.
- Etcd restore operation should support marking old revisions as compacted.
- Etcd restore operation should support bumping revision of the latest key versions.
Tracking work
- Implement optional
--bump-revision
flag toetcdutl snapshot restore
operation. - Implement optional
--mark-compacted
flag toetcdutl snapshot restore
operation. - Confirm that those changes are backport compatible for v3.5 and v3.4
- Add etcd restore with e2e tests that confirm that:
- Revision doesn’t decrease if value provided via
--bump-revision
is high enough - Client is notified about restore by watch being broken
- Client cannot access revisions from between backup and restore operation
- Revision doesn’t decrease if value provided via
- Backport the changes and tests to v3.5 and v3.4 etcd.
Why is this needed?
Without zero official guidance each vendor/administrator had a different plan for etcd restore many of which were broken and incorrect. Having an official guided restore operation allow whole community to work together and share their experience leading to improvements in Kubernetes disaster recovery handling.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 19 (19 by maintainers)
Just had a quick discussion with @serathius .
There are two things.
marking old revisions as compacted
”. It’s actually just setting thescheduledCompactRev
field in Meta bucket. So thefinishedCompactRev
!=scheduledCompactRev
, it means that previous compaction did not finish. When etcd starts, it will try to finish the compaction operation (so there is no performance differences between the two approaches). It’s the minimal change on etcdutl, and zero change for etcdserver. The only concern is it’s a little hacky. We need test it carefully. Let’s follow this approach for now.From a etcd cluster administrator perspective, I would like the etcdctl snapshot restore flag named as
desired-compacted-revision
.I don’t want to calculate how much revision should I bump up since last etcd snapshot, I just need to supply an desired resource version to be compacted based on the observation in apiserver logs.
As a mitigation, naming the snapshot db file suffix with the revision and doing a calculation afterwards also works… It’s just not as convenient as just supplying the desired one.
I’ve been able to verify that marking a revision as compacted:
mvcc.UnsafeSetScheduledCompact(tx, revision)
does cause the server to think that it has a compaction in progress which it will finish (sometime on startup after restore), and it causes the watchers to get canceled with therequired revision has been compacted
error.The following is from running the server after restoring with my most recent changes in my draft PR with
--bump-revision 1000000 --mark-compacted
argumentsThanks for all the discussions in the meantime, I think I’m also a little further in my understanding and try to summarize in more detailed pieces.
Let’s look at this from a K8s API point of view first. We want to ensure all operators/controllers consuming the watchlist API will receive a signal to invalidate their caches. Working backwards, from the official docs we can find that a 410 should be returned to signal a fresh listing from the start.
Which in turn, is always triggered by our own etcd return of
etcdrpc.ErrCompacted
when using the etcd watches.From etcd point of view, we want to ensure that we have a high enough revision to start from after restoring from a snapshot before we serve any data to the outside world. Yet, we also need to ensure that we don’t serve any history between the backup revision (BR) and restored revision (NR, basically BR + a big enough integer). Conveniently, this is exactly what the compaction does.
For that, do we really need to run a full compaction? Or is it enough to mark a revision range as already compacted?
I think there are multiple approaches here, one that’s hacky would be to use
UnsafeSetScheduledCompact
and use thescheduledCompactRev
schema key to bump it. Not sure if this will already do the trick, @dusk125 wanted to check that out next. There’s also opportunity in changing the compactMainRev in mvcc kvstore.go going through the code.Rewinding a little, how to establish the revision in the snapshot?
Similar to Allen’s approach, we discussed about O(n) looping through each key in the snapshot to establish which revision is the highest. If you look into kvstore.go, it already does that. I think there were some reservations around the performance impact of its linear runtime.
Since creating a snapshot already requires iteration, we would also propose caching this information in the snapshot proto as additional/trailing metadata. When contained in a snapshot it would allows us to save the iteration of the whole key space when restoring.
I think we already have
etcdctl compaction <rev>
that does that, albeit online.Maybe before we go into the debate on what we need to implement for compaction, do we even need to compact? I think what Marek meant, is that we return the current revision as “compacted” to force reset the informer caches, not do an actual compaction.