etcd: Corrupted data after using rollback tool

I’m performing the following operations:

Start etcd (as storage for kubernetes) in version 3.0.14
Write data in v3 via kube apiserver
Kill etcd and rollback data to v2
Start etcd in version 2.3.7 and check that it works; kill etcd
Start etcd in version 3.0.14 with random port (to avoid any writes) with data in v2 and make sure it works and kill it (in this particular workflow this step is optional but it’s needed in some other scenarios)
Migrate data to v3
start etcd 3.0.14, create a lease and attach all keys to it, kill etcd
Start etcd in version 3.0.14 and use v3 data

Unfortunately for some reasons sometimes steps 7&8 (and 6 IIRC) are failing with the following error:

2017-02-14 13:34:53.369897 C | etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired

I’m attaching wal file that it keeps failing on.

0000000000000000-0000000000000000.wal.gz

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 15 (7 by maintainers)

Most upvoted comments

@wojtek-t Term is directly from raft. It represents an election.

Judging by the WAL data (the sizes aren’t padded and the corrupt entry doesn’t clobber anything, it’s directly appended suggesting 2.3.x’s O_APPEND), there’s a 2.3.x process writing the new entries. Likewise, the file locking logic in 2.3.x is unconvincing to say the least. It’s possible 2.3.7 is still running / relaunching using the same wal file during the migration.

heyitsanthony on Feb 15, 2017