etcd: Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes

What would you like to be added?

This is etcd side bug for https://github.com/kubernetes/kubernetes/issues/118501

tldr; Etcd restore operation breaks existing etcd guarantees about revision never decreasing.

Proposal

  • Etcd restore operation should support bumping revision of the next etcd operations by an arbitrary number.
  • Etcd restore operation should support marking old revisions as compacted.
  • Etcd restore operation should support bumping revision of the latest key versions.

Tracking work

  • Implement optional --bump-revision flag to etcdutl snapshot restore operation.
  • Implement optional --mark-compacted flag to etcdutl snapshot restore operation.
  • Confirm that those changes are backport compatible for v3.5 and v3.4
  • Add etcd restore with e2e tests that confirm that:
    • Revision doesn’t decrease if value provided via --bump-revision is high enough
    • Client is notified about restore by watch being broken
    • Client cannot access revisions from between backup and restore operation
  • Backport the changes and tests to v3.5 and v3.4 etcd.

Why is this needed?

Without zero official guidance each vendor/administrator had a different plan for etcd restore many of which were broken and incorrect. Having an official guided restore operation allow whole community to work together and share their experience leading to improvements in Kubernetes disaster recovery handling.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 19 (19 by maintainers)

Most upvoted comments

Just had a quick discussion with @serathius .

There are two things.

  1. Bumping revision. We are on the same page. The purpose is to make sure the revision will never decrease.
  2. Compaction. Two options
    • @serathius 's suggestion: “marking old revisions as compacted”. It’s actually just setting the scheduledCompactRev field in Meta bucket. So the finishedCompactRev != scheduledCompactRev, it means that previous compaction did not finish. When etcd starts, it will try to finish the compaction operation (so there is no performance differences between the two approaches). It’s the minimal change on etcdutl, and zero change for etcdserver. The only concern is it’s a little hacky. We need test it carefully. Let’s follow this approach for now.
    • @ahrtr 's suggestion: offline compaction, just as I mentioned in https://github.com/etcd-io/etcd/issues/16028#issuecomment-1581737360. We already have offline defrag, it’s natural to enhance etcdutl to support offline compaction. We can discuss this separately. FYI. We might get rid of defrag completely after https://github.com/etcd-io/bbolt/issues/422 is resolved, but we will always need offline compaction.

From a etcd cluster administrator perspective, I would like the etcdctl snapshot restore flag named as desired-compacted-revision.

I don’t want to calculate how much revision should I bump up since last etcd snapshot, I just need to supply an desired resource version to be compacted based on the observation in apiserver logs.

As a mitigation, naming the snapshot db file suffix with the revision and doing a calculation afterwards also works… It’s just not as convenient as just supplying the desired one.

I’ve been able to verify that marking a revision as compacted: mvcc.UnsafeSetScheduledCompact(tx, revision) does cause the server to think that it has a compaction in progress which it will finish (sometime on startup after restore), and it causes the watchers to get canceled with the required revision has been compacted error.

The following is from running the server after restoring with my most recent changes in my draft PR with --bump-revision 1000000 --mark-compacted arguments

{"caller":"mvcc/kvstore.go:385","msg":"kvstore restored","current-rev":1000004}
{"caller":"mvcc/kvstore.go:392","msg":"resume scheduled compaction","scheduled-compact-revision":1000004}
{"caller":"mvcc/index.go:194","msg":"compact tree index","revision":1000004}
{"caller":"mvcc/kvstore_compaction.go:69","msg":"finished scheduled compaction","compact-revision":1000004,"took":"3.403894ms","hash":1818873421}

Thanks for all the discussions in the meantime, I think I’m also a little further in my understanding and try to summarize in more detailed pieces.

Let’s look at this from a K8s API point of view first. We want to ensure all operators/controllers consuming the watchlist API will receive a signal to invalidate their caches. Working backwards, from the official docs we can find that a 410 should be returned to signal a fresh listing from the start.

Which in turn, is always triggered by our own etcd return of etcdrpc.ErrCompacted when using the etcd watches.


From etcd point of view, we want to ensure that we have a high enough revision to start from after restoring from a snapshot before we serve any data to the outside world. Yet, we also need to ensure that we don’t serve any history between the backup revision (BR) and restored revision (NR, basically BR + a big enough integer). Conveniently, this is exactly what the compaction does.

For that, do we really need to run a full compaction? Or is it enough to mark a revision range as already compacted?

I think there are multiple approaches here, one that’s hacky would be to use UnsafeSetScheduledCompact and use the scheduledCompactRev schema key to bump it. Not sure if this will already do the trick, @dusk125 wanted to check that out next. There’s also opportunity in changing the compactMainRev in mvcc kvstore.go going through the code.

Rewinding a little, how to establish the revision in the snapshot?

Similar to Allen’s approach, we discussed about O(n) looping through each key in the snapshot to establish which revision is the highest. If you look into kvstore.go, it already does that. I think there were some reservations around the performance impact of its linear runtime.

Since creating a snapshot already requires iteration, we would also propose caching this information in the snapshot proto as additional/trailing metadata. When contained in a snapshot it would allows us to save the iteration of the whole key space when restoring.

I think we already have etcdctl compaction <rev> that does that, albeit online.

Maybe before we go into the debate on what we need to implement for compaction, do we even need to compact? I think what Marek meant, is that we return the current revision as “compacted” to force reset the informer caches, not do an actual compaction.