rke: Failed to get /health for host - remote error: tls: bad certificate

Getting Failed to get /health for host - remote error: tls: bad certificate when trying to upgrade an existing cluster. No modification to certificates have been done.

RKE version: rke version v0.2.1

Docker version:

Client:
 Version:           18.06.3-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        d7080c1
 Built:             Wed Feb 20 02:27:18 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.3-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       d7080c1
  Built:            Wed Feb 20 02:26:20 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Operating system and kernel: (cat /etc/os-release, uname -r preferred) 16.04.4 LTS (Xenial Xerus) 4.4.0-116-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) ESXi Virtual Machine

cluster.yml file:

nodes:
  - address: 10.10.7.121
    user: daniel
    role: [controlplane,worker,etcd]
  - address: 10.10.7.122
    user: daniel
    role: [controlplane,worker,etcd]
  - address: 10.10.7.123
    user: daniel
    role: [controlplane,worker,etcd]

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h

Steps to Reproduce: ./rke -d up

Results:

...
DEBU[0028] [remove/rke-log-linker] Container doesn't exist on host [10.10.7.123] 
DEBU[0028] [etcd] Checking image [rancher/rke-tools:v0.1.27] on host [10.10.7.123] 
DEBU[0028] Checking if image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123] 
DEBU[0028] Image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123] 
DEBU[0028] [etcd] No pull necessary, image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123] 
INFO[0029] [etcd] Successfully started [rke-log-linker] container on host [10.10.7.123] 
DEBU[0029] [remove/rke-log-linker] Checking if container is running on host [10.10.7.123] 
DEBU[0029] [remove/rke-log-linker] Removing container on host [10.10.7.123] 
INFO[0029] [remove/rke-log-linker] Successfully removed container on host [10.10.7.123] 
DEBU[0029] [etcd] Successfully created log link for Container [etcd] on host [10.10.7.123] 
INFO[0029] [etcd] Successfully started etcd plane.. Checking etcd cluster health 
DEBU[0029] [etcd] Check etcd cluster health             
DEBU[0029] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate 
DEBU[0034] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate 
DEBU[0039] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate 
DEBU[0044] [etcd] Check etcd cluster health             
DEBU[0045] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate 
DEBU[0050] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate 
DEBU[0055] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate 
DEBU[0060] [etcd] Check etcd cluster health             
DEBU[0060] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate 
DEBU[0065] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate 
DEBU[0070] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate 
FATA[0075] [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy 
```

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 28 (6 by maintainers)

Most upvoted comments

Per conversation with @Oats87 have now reproduced a cause for this error upon attempted upgrade of cluster via RKE v0.2.0 or v0.2.1.

If the kube_config_<file>.yml file is absent from the local directory when you perform rke up RKE treats the cluster as new rather than a legacy cluster, which will result in the [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy fatal error, with debug messages of the format Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate.

Reproducer

  1. Instantiate a simple single node cluster with rke up using RKE v0.1.7
  2. Remove the kube_config_<file>.yml file
  3. Attempt to upgrade cluster via rke -d up using RKE v0.2.0 or v0.2.1
  4. Observe [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy error with /health: remote error: tls: bad certificate messages.

Workaround Upon encountering this issue as a result of the missing kube_config_<file>.yml during upgrade, the following workaround can be used:

# Remove your `<file>.rkestate` file

# Log into all of your control plane nodes and run:

rm -f /etc/kubernetes/ssl/kube-service-account-token-key.pem
rm -f /etc/kubernetes/ssl/kube-service-account-token.pem
cp /etc/kubernetes/ssl/kube-apiserver-key.pem /etc/kubernetes/ssl/kube-service-account-token-key.pem
cp /etc/kubernetes/ssl/kube-apiserver.pem /etc/kubernetes/ssl/kube-service-account-token.pem

# Run an `rke up` with RKE 0.1.17

# Run an `rke up` with RKE 0.2.0/0.2.1

Getting same Issue on CentOS Linux release 8.3.2011, Docker 19.03.5, rke version v1.2.5.

WARN[0212] [etcd] host [rke1.###.net] failed to check etcd health: failed to get /health for host [rke1.###.###]: Get "https://rke1.###.net:2379/health": remote error: tls: bad certificate WARN[0306] [etcd] host [rke2.###.net] failed to check etcd health: failed to get /health for host [rke2.###.###]: Get "https://rke2.###.net:2379/health": remote error: tls: bad certificate WARN[0399] [etcd] host [rke3.###.net] failed to check etcd health: failed to get /health for host [rke3.###.###]: Get "https://rke3.###.net:2379/health": remote error: tls: bad certificate FATA[0399] [etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [rke1.###.net,rke2.###.net,rke3.###.net] failed to report healthy. Check etcd container logs on each host for more information

I tried the workaround mentioned by @axeal, but didn’t help in my case…

I had the same issue using version v1.0.4 but my problem was solved by @axeal’s answer. However, after deleting my .rkestate I got this error and had to recreate it. This script might be handy if someone needs to recreate it.

@axeal the workaround is missing the additional step of “Remove your kube_config_<file>.yml file” at the beginning, so that when you run the rke up with 0.1.x RKE re-generates a valid kube_config_<file>.yml