rke: Stale/old etcd-checksum-checker causes etcd snapshot-restore to fail by showing `etcd snapshots are not consistent`

I noticed this while trying to restore an ectd snapshot using rke etcd snapshot-restore --name 2022-06-29T10:11:10Z_etcd. It always fails on checking the checksum with the following:

INFO[0015] Waiting for [etcd-checksum-checker] container to exit on host [10.20.213.11]
INFO[0015] Container [etcd-checksum-checker] is still running on host [10.20.213.11]: stderr: [snapshot file does not exist
], stdout: []
INFO[0016] Waiting for [etcd-checksum-checker] container to exit on host [10.20.213.11]
FATA[0016] etcd snapshots are not consistent

This is because, inspecting the docker command, the path is double included:

"Cmd": [
                "sh",
                "-c",
                " if [ -f '/opt/rke/etcd-snapshots//opt/rke/etcd-snapshots/2022-06-21T01:16:39Z_etcd.zip' ]; then md5sum '/opt/rke/etcd-snapshots//opt/rke/etcd-snapshots/2022-06-21T01:16:39Z_etcd.zip' | cut -f1 -d' ' | tr -d '\n'; else echo 'snapshot file does not exist' >&2; fi"
            ]

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 35 (15 by maintainers)

Most upvoted comments

I have

  • rke remove
  • docker rm -f $(docker ps -a -q) on each node
  • reboot each node
  • rke etcd snapshot-restore --config cluster.yml --name 2022-11-16T03:42:54Z_etcd

Each of the pods seems to be KO, I’ll dig into it, but everything restarted this time … it was indeed a docker container staying lost …

I’ll have to look into it.

Thanks a lot 🙏 @gmanera, for your inputs!

@superseb Thx a lot for your time. That’s twice now you’ve saved me from some pretty big mistakes!

@superseb , I gladly to inform I’ve success solving the issue. I remove the previously etcd-checksum-checker container, and worked just fine.

Now I’ve the following error message

FATA[0016] [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 1, container logs: {"level":"info","ts":1668799464.946725,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"/opt/rke/etcd-snapshots/2022-11-18T14:28:35Z_etcd.zip","wal-dir":"/opt/rke/etcd-snapshots-restore/member/wal","data-dir":"/opt/rke/etcd-snapshots-restore/","snap-dir":"/opt/rke/etcd-snapshots-restore/member/snap"}
Error: snapshot missing hash but --skip-hash-check=false
[rancher@praiaflorida etcd-snapshots]$

Thanks.

@PierreBrisorgueil I would expect RKE to re-create the container but as your log says Starting stopped container [etcd-checksum-checker], it would help to delete this container manually as well and re-run the rke command.

I downloaded the source, corrected/hardcoded the relevant portion, and built a binary for myself.On 2022. Nov 16., at 22:14, Pierre Brisorgueil @.***> wrote: @gha-xena @knandras did you find a workaround?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

I’ve found this: https://www.suse.com/support/kb/doc/?id=000020214. Worker just fine too. Thanks.

Short term solution without root cause is placing the snapshot at the location it is looking for I assume. What is the last version you successfully tested your restore scenario with?

It will really help me to be able to re up this snapshot 😔

Ok it seems that it is trying to use /opt/rke/etcd-snapshots/./snapshots/ as directory instead of /opt/rke/etcd-snapshots/, I have to dig through some code to see where this could come from but if you have any ideas, let me know. Is /opt/rke possibly mounted/symlinked or anything that could interfere with directories/paths? (ls -ld /opt/rke && ls -ld /opt/rke/etcd-snapshots)

For completeness, please share the complete log. Also, if possible, the output of docker inspect etcd-checksum-checker and the output of the following commands for every cluster node:

df /opt/rke/etcd-snapshots/
ls -la /opt/rke/etcd-snapshots/

@PierreBrisorgueil I deleted your comment because it contained sensitive info, please change your password(s) and post a redacted version

If this would always fail, no one would be able to restore any snapshot. @gha-xena @knandras @PierreBrisorgueil please include RKE version and cluster.yml to identify what is causing this. (and if possible, the versions that you used that did work)

I will take a look at the code if anything changed or what can cause this.