etcd: Corrupted data after using rollback tool

I’m performing the following operations:

  1. Start etcd (as storage for kubernetes) in version 3.0.14
  2. Write data in v3 via kube apiserver
  3. Kill etcd and rollback data to v2
  4. Start etcd in version 2.3.7 and check that it works; kill etcd
  5. Start etcd in version 3.0.14 with random port (to avoid any writes) with data in v2 and make sure it works and kill it (in this particular workflow this step is optional but it’s needed in some other scenarios)
  6. Migrate data to v3
  7. start etcd 3.0.14, create a lease and attach all keys to it, kill etcd
  8. Start etcd in version 3.0.14 and use v3 data

Unfortunately for some reasons sometimes steps 7&8 (and 6 IIRC) are failing with the following error:

2017-02-14 13:34:53.369897 C | etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired

I’m attaching wal file that it keeps failing on.

0000000000000000-0000000000000000.wal.gz

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

@wojtek-t Term is directly from raft. It represents an election.

Judging by the WAL data (the sizes aren’t padded and the corrupt entry doesn’t clobber anything, it’s directly appended suggesting 2.3.x’s O_APPEND), there’s a 2.3.x process writing the new entries. Likewise, the file locking logic in 2.3.x is unconvincing to say the least. It’s possible 2.3.7 is still running / relaunching using the same wal file during the migration.