rke: RKE restore etcd : Error: snapshot missing hash but --skip-hash-check=false
RKE version:
v0.2.6
Docker version:
Client:
Version: 1.13.1
API version: 1.26
Package version: docker-1.13.1-60.git9cb56fd.fc28.x86_64
Operating system and kernel
Fedora 28 (Atomic Host) 4.17.11-200.fc28.x86_64
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
openstack instance
cluster.yml file:
rancher-cluster.yml
nodes:
- address: 10.57.241.146
internal_address: 192.168.99.68
user: fedora
role: [controlplane,worker,etcd]
ssh_key_path: /home/fedora/.ssh/id_rsa
- address: 10.57.241.148
internal_address: 192.168.99.70
user: fedora
role: [controlplane,worker,etcd]
ssh_key_path: /home/fedora/.ssh/id_rsa
- address: 10.57.241.149
internal_address: 192.168.99.69
user: fedora
role: [controlplane,worker,etcd]
ssh_key_path: /home/fedora/.ssh/id_rsa
private_registries:
- url: 10.57.241.229:5000
is_default: true
rancher-cluster-restore.yml ( only keep destroyed node3)
nodes:
- address: 10.57.241.149
internal_address: 192.168.99.69
user: fedora
role: [controlplane,worker,etcd]
ssh_key_path: /home/fedora/.ssh/id_rsa
private_registries:
- url: 10.57.241.229:5000
is_default: true
Steps to Reproduce:
- Create 3 nodes and setup rancher HA.
- Run below command to save snapshot
rke etcd snapshot-save --name 20190725-093400 --config rancher-cluster.yml - Destroy node3 (10.57.241.149) and rebuild it.
- Run below command to restore snapshot for node3.
rke etcd snapshot-restore --name 20190725-093400.zip --config rancher-cluster-restore.yml
Results:
[root@tpe-liberty-alex-fedora-1 restote]# rke etcd snapshot-restore --name 20190725-093400.zip --config rancher-cluster-restore.yml
INFO[0000] Restoring etcd snapshot 20190725-093400.zip
INFO[0000] Successfully Deployed state file at [./rancher-cluster-restore.rkestate]
INFO[0000] [dialer] Setup tunnel for host [10.57.241.149]
WARN[0011] failed to stop etcd container on host [10.57.241.149]: Can't stop Docker container [etcd] for host [10.57.241.149]: Error response from daemon: No such container: etcd
INFO[0011] [etcd] starting backup server on host [10.57.241.149]
INFO[0019] [etcd] Successfully started [etcd-Serve-backup] container on host [10.57.241.149]
INFO[0035] [remove/etcd-Serve-backup] Successfully removed container on host [10.57.241.149]
INFO[0035] [etcd] Checking if all snapshots are identical
INFO[0044] [etcd] Successfully started [etcd-checksum-checker] container on host [10.57.241.149]
INFO[0044] Waiting for [etcd-checksum-checker] container to exit on host [10.57.241.149]
INFO[0050] [etcd] Checksum of etcd snapshot on host [10.57.241.149] is [f586f0c56e06b56df9f63a0ff17e54dd]
INFO[0050] Cleaning old kubernetes cluster
INFO[0050] [worker] Tearing down Worker Plane..
INFO[0050] [worker] Successfully tore down Worker Plane..
INFO[0050] [controlplane] Tearing down the Controller Plane..
INFO[0050] [controlplane] Successfully tore down Controller Plane..
INFO[0050] [etcd] Tearing down etcd plane..
INFO[0050] [etcd] Successfully tore down etcd plane..
INFO[0050] [hosts] Cleaning up host [10.57.241.149]
INFO[0050] [hosts] Cleaning up host [10.57.241.149]
INFO[0050] [hosts] Running cleaner container on host [10.57.241.149]
INFO[0072] [kube-cleaner] Successfully started [kube-cleaner] container on host [10.57.241.149]
INFO[0072] Waiting for [kube-cleaner] container to exit on host [10.57.241.149]
INFO[0075] [hosts] Removing cleaner container on host [10.57.241.149]
INFO[0076] [hosts] Removing dead container logs on host [10.57.241.149]
INFO[0089] [cleanup] Successfully started [rke-log-cleaner] container on host [10.57.241.149]
INFO[0091] [remove/rke-log-cleaner] Successfully removed container on host [10.57.241.149]
INFO[0091] [hosts] Successfully cleaned up host [10.57.241.149]
INFO[0091] [etcd] Restoring [20190725-093400.zip] snapshot on etcd host [10.57.241.149]
INFO[0094] [etcd] Successfully started [etcd-restore] container on host [10.57.241.149]
INFO[0094] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0094] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0095] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0095] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0096] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0096] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0097] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
FATA[0100] [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 128, container logs: Error: snapshot missing hash but --skip-hash-check=false
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (4 by maintainers)
The name of the snapshot is
20190725-093400, not including the extension.We never enforced naming in the beginning so we can’t enforce or change anything now because we didn’t set the convention from the beginning. The main issue is that we have had snapshots without being an archive before and then we switched so we need to deal with both options and we can’t enforce any because we need to support both. Manually stripping the extension or other magic is just going to be confusing as it’s going to happen automatically. That’s why I asked if a warning was enough in the beginning.
@sadiapoddar If you think we need more guidance in this situation we can file another issue that covers fine tuning the whole process which will probably involve better checking for file existence (and possibly suggest or try multiple names based on input), but that scope is way bigger than this and needs to be designed.