rke: RKE restore etcd : Error: snapshot missing hash but --skip-hash-check=false

RKE version: v0.2.6

Docker version:

Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-60.git9cb56fd.fc28.x86_64

Operating system and kernel Fedora 28 (Atomic Host) 4.17.11-200.fc28.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) openstack instance

cluster.yml file:

rancher-cluster.yml

nodes:
  - address: 10.57.241.146
    internal_address: 192.168.99.68
    user: fedora
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/fedora/.ssh/id_rsa
  - address: 10.57.241.148
    internal_address: 192.168.99.70
    user: fedora
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/fedora/.ssh/id_rsa
  - address: 10.57.241.149
    internal_address: 192.168.99.69
    user: fedora
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/fedora/.ssh/id_rsa

private_registries:
  - url: 10.57.241.229:5000
    is_default: true

rancher-cluster-restore.yml ( only keep destroyed node3)

nodes:
  - address: 10.57.241.149
    internal_address: 192.168.99.69
    user: fedora
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/fedora/.ssh/id_rsa

private_registries:
  - url: 10.57.241.229:5000
    is_default: true

Steps to Reproduce:

  1. Create 3 nodes and setup rancher HA.
  2. Run below command to save snapshot rke etcd snapshot-save --name 20190725-093400 --config rancher-cluster.yml
  3. Destroy node3 (10.57.241.149) and rebuild it.
  4. Run below command to restore snapshot for node3. rke etcd snapshot-restore --name 20190725-093400.zip --config rancher-cluster-restore.yml

Results:

[root@tpe-liberty-alex-fedora-1 restote]# rke etcd snapshot-restore --name 20190725-093400.zip --config rancher-cluster-restore.yml
INFO[0000] Restoring etcd snapshot 20190725-093400.zip
INFO[0000] Successfully Deployed state file at [./rancher-cluster-restore.rkestate]
INFO[0000] [dialer] Setup tunnel for host [10.57.241.149]
WARN[0011] failed to stop etcd container on host [10.57.241.149]: Can't stop Docker container [etcd] for host [10.57.241.149]: Error response from daemon: No such container: etcd
INFO[0011] [etcd] starting backup server on host [10.57.241.149]
INFO[0019] [etcd] Successfully started [etcd-Serve-backup] container on host [10.57.241.149]
INFO[0035] [remove/etcd-Serve-backup] Successfully removed container on host [10.57.241.149]
INFO[0035] [etcd] Checking if all snapshots are identical
INFO[0044] [etcd] Successfully started [etcd-checksum-checker] container on host [10.57.241.149]
INFO[0044] Waiting for [etcd-checksum-checker] container to exit on host [10.57.241.149]
INFO[0050] [etcd] Checksum of etcd snapshot on host [10.57.241.149] is [f586f0c56e06b56df9f63a0ff17e54dd]
INFO[0050] Cleaning old kubernetes cluster
INFO[0050] [worker] Tearing down Worker Plane..
INFO[0050] [worker] Successfully tore down Worker Plane..
INFO[0050] [controlplane] Tearing down the Controller Plane..
INFO[0050] [controlplane] Successfully tore down Controller Plane..
INFO[0050] [etcd] Tearing down etcd plane..
INFO[0050] [etcd] Successfully tore down etcd plane..
INFO[0050] [hosts] Cleaning up host [10.57.241.149]
INFO[0050] [hosts] Cleaning up host [10.57.241.149]
INFO[0050] [hosts] Running cleaner container on host [10.57.241.149]
INFO[0072] [kube-cleaner] Successfully started [kube-cleaner] container on host [10.57.241.149]
INFO[0072] Waiting for [kube-cleaner] container to exit on host [10.57.241.149]
INFO[0075] [hosts] Removing cleaner container on host [10.57.241.149]
INFO[0076] [hosts] Removing dead container logs on host [10.57.241.149]
INFO[0089] [cleanup] Successfully started [rke-log-cleaner] container on host [10.57.241.149]
INFO[0091] [remove/rke-log-cleaner] Successfully removed container on host [10.57.241.149]
INFO[0091] [hosts] Successfully cleaned up host [10.57.241.149]
INFO[0091] [etcd] Restoring [20190725-093400.zip] snapshot on etcd host [10.57.241.149]
INFO[0094] [etcd] Successfully started [etcd-restore] container on host [10.57.241.149]
INFO[0094] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0094] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0095] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0095] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0096] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0096] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0097] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
FATA[0100] [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 128, container logs: Error: snapshot missing hash but --skip-hash-check=false

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (4 by maintainers)

Most upvoted comments

The name of the snapshot is 20190725-093400, not including the extension.

We never enforced naming in the beginning so we can’t enforce or change anything now because we didn’t set the convention from the beginning. The main issue is that we have had snapshots without being an archive before and then we switched so we need to deal with both options and we can’t enforce any because we need to support both. Manually stripping the extension or other magic is just going to be confusing as it’s going to happen automatically. That’s why I asked if a warning was enough in the beginning.

@sadiapoddar If you think we need more guidance in this situation we can file another issue that covers fine tuning the whole process which will probably involve better checking for file existence (and possibly suggest or try multiple names based on input), but that scope is way bigger than this and needs to be designed.