rke: Stale/old etcd-checksum-checker causes etcd snapshot-restore to fail by showing `etcd snapshots are not consistent`
I noticed this while trying to restore an ectd snapshot using rke etcd snapshot-restore --name 2022-06-29T10:11:10Z_etcd. It always fails on checking the checksum with the following:
INFO[0015] Waiting for [etcd-checksum-checker] container to exit on host [10.20.213.11]
INFO[0015] Container [etcd-checksum-checker] is still running on host [10.20.213.11]: stderr: [snapshot file does not exist
], stdout: []
INFO[0016] Waiting for [etcd-checksum-checker] container to exit on host [10.20.213.11]
FATA[0016] etcd snapshots are not consistent
This is because, inspecting the docker command, the path is double included:
"Cmd": [
"sh",
"-c",
" if [ -f '/opt/rke/etcd-snapshots//opt/rke/etcd-snapshots/2022-06-21T01:16:39Z_etcd.zip' ]; then md5sum '/opt/rke/etcd-snapshots//opt/rke/etcd-snapshots/2022-06-21T01:16:39Z_etcd.zip' | cut -f1 -d' ' | tr -d '\n'; else echo 'snapshot file does not exist' >&2; fi"
]
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 35 (15 by maintainers)
I have
Each of the pods seems to be KO, I’ll dig into it, but everything restarted this time … it was indeed a docker container staying lost …
I’ll have to look into it.
Thanks a lot 🙏 @gmanera, for your inputs!
@superseb Thx a lot for your time. That’s twice now you’ve saved me from some pretty big mistakes!
@superseb , I gladly to inform I’ve success solving the issue. I remove the previously etcd-checksum-checker container, and worked just fine.
Now I’ve the following error message
Thanks.
@PierreBrisorgueil I would expect RKE to re-create the container but as your log says
Starting stopped container [etcd-checksum-checker], it would help to delete this container manually as well and re-run therkecommand.I downloaded the source, corrected/hardcoded the relevant portion, and built a binary for myself.On 2022. Nov 16., at 22:14, Pierre Brisorgueil @.***> wrote: @gha-xena @knandras did you find a workaround?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
I’ve found this: https://www.suse.com/support/kb/doc/?id=000020214. Worker just fine too. Thanks.
Short term solution without root cause is placing the snapshot at the location it is looking for I assume. What is the last version you successfully tested your restore scenario with?
It will really help me to be able to re up this snapshot 😔
Ok it seems that it is trying to use
/opt/rke/etcd-snapshots/./snapshots/as directory instead of/opt/rke/etcd-snapshots/, I have to dig through some code to see where this could come from but if you have any ideas, let me know. Is/opt/rkepossibly mounted/symlinked or anything that could interfere with directories/paths? (ls -ld /opt/rke&&ls -ld /opt/rke/etcd-snapshots)For completeness, please share the complete log. Also, if possible, the output of
docker inspect etcd-checksum-checkerand the output of the following commands for every cluster node:@PierreBrisorgueil I deleted your comment because it contained sensitive info, please change your password(s) and post a redacted version
If this would always fail, no one would be able to restore any snapshot. @gha-xena @knandras @PierreBrisorgueil please include RKE version and cluster.yml to identify what is causing this. (and if possible, the versions that you used that did work)
I will take a look at the code if anything changed or what can cause this.