harvester: [Doc] user report: after adding a 3rd node into cluster, the2nd node goes to cordoned/Unavailable state

Describe the bug

User @rajesh1084 report: “I’m trying to build a 3-node Harvester HCI setup. As soon as I add the 3rd node, 2nd node goes to cordoned/Unavailable state (Kubelet stopped posting node status). Any thoughts?”

image

For better understanding the cluster. Please note: harv-node3 is the second NODE joined the cluster. image

The harv-node3 CAN NOT recover by itself.

Time line:

harv-node1, the first node in cluster
harv-node3, the second node in cluster
  Oct 20 11:47:43 harv-node3 systemd[1]: Finished Rancher Bootstrap.

  first as an agent node, unti Oct 20 18:37:01

harv-node2, the third node, join the cluster
  
harv-node3 switch role from agent to server
  
Oct 20 18:37:01 harv-node3 systemd[1]: rke2-agent.service: Unit process 46440 (containerd-shim) remains running after unit>
Oct 20 18:37:01 harv-node3 systemd[1]: Stopped Rancher Kubernetes Engine v2 (agent).
  
  then become rke-2 server 
  
  but the rke2-server.service has continuous error

To Reproduce Steps to reproduce the behavior:

  1. Go to ‘…’

Expected behavior

Each NODE should be in healthy state

Support bundle

posted in : https://github.com/harvester/harvester/issues/3039#issuecomment-1293019045

The harv-node3 is abnormal, the support-bundle does NOT include related file. Below file is generated via journalctl --unit=rke2-server in harv-node3.

rke2-server.log

harv-node1:~ # for ip in {172.26.50.135,172.26.50.137,172.26.50.138}; do echo $ip; ssh rancher@$ip date; done
172.26.50.135
Wed Oct 26 10:11:59 UTC 2022
172.26.50.137
Wed Oct 26 10:12:00 UTC 2022
172.26.50.138
Wed Oct 26 10:12:00 UTC 2022
harv-node3:~ # ping 172.26.50.135  (harv-node1)
PING 172.26.50.135 (172.26.50.135) 56(84) bytes of data.
64 bytes from 172.26.50.135: icmp_seq=1 ttl=64 time=0.155 ms
64 bytes from 172.26.50.135: icmp_seq=2 ttl=64 time=0.169 ms

Environment

  • Harvester ISO version:
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 25 (11 by maintainers)

Most upvoted comments

I had not configured no-proxy when I had faced the issue.

I’m confused, the issue mentions that adding a third node causes the second node to become unavailable. I’m only seeing logs here from node 1 and 3 - where are the logs from node 2?

I will also say that the way rajesh1084 has pasted the logs and CLI output into the comment makes it very hard to read and follow this thread. Please surround things with a code block:

    ```
    output goes here
    ```

kube-system_kube-apiserver-harv-node1_5ca3c4e79a65ef37c1d591197744493f.zip kube-system_etcd-harv-node3_67c7ddaaa7ef609d17948e7397585b18.zip kubelet_harv-node3.zip

rancher@harv-node1:~> ps aux | grep " kube-apiserver "
root     12049  251  1.1 4699348 2998996 ?     Ssl  Oct20 51006:47 kube-apiserver --audit-policy-file=/etc/rancher/rke2/config.yaml.d/92-harvester-kube-audit-policy.yaml --audit-log-path=/var/lib/rancher/rke2/server/logs/audit.log --audit-log-maxage=30 --audit-log-maxbackup=10 --audit-log-maxsize=100 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,rke2 --authorization-mode=Node,RBAC --bind-address=0.0.0.0 --cert-dir=/var/lib/rancher/rke2/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/rke2/server/tls/client-ca.crt --egress-selector-config-file=/var/lib/rancher/rke2/server/etc/egress-selector-config.yaml --enable-admission-plugins=NodeRestriction,PodSecurityPolicy --enable-aggregator-routing=true --encryption-provider-config=/var/lib/rancher/rke2/server/cred/encryption-config.json --etcd-cafile=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/rke2/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/rke2/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --feature-gates=JobTrackingWithFinalizers=true --kubelet-certificate-authority=/var/lib/rancher/rke2/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --proxy-client-cert-file=/var/lib/rancher/rke2/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/rke2/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/rke2/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/rke2/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/rke2/server/tls/service.key --service-cluster-ip-range=10.53.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/rke2/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/rke2/server/tls/serving-kube-apiserver.key

rancher@harv-node1:~> ps aux | grep " etcd "
root     12046 62.2  0.2 11779908 582260 ?     Ssl  Oct20 12652:04 etcd --config-file=/var/lib/rancher/rke2/server/db/etcd/config

rancher@harv-node1:~> ps aux | grep " kubelet "
root      9976 12.3  0.0 981020 212944 ?       Sl   Oct20 2508:51 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --address=0.0.0.0 --alsologtostderr=false --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.53.0.10 --cluster-domain=cluster.local --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=harv-node1 --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --log-file=/var/lib/rancher/rke2/agent/logs/kubelet.log --log-file-max-size=50 --logtostderr=false --node-labels=rke.cattle.io/machine=a4f0df05-abb4-435c-8fe2-ddba5042ac2d --pod-infra-container-image=index.docker.io/rancher/pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --serialize-image-pulls=false --stderrthreshold=FATAL --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key

rancher@harv-node1:~> sudo cat /var/lib/rancher/rke2/server/db/etcd/config
advertise-client-urls: https://172.26.50.135:2379
client-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
data-dir: /var/lib/rancher/rke2/server/db/etcd
election-timeout: 5000
experimental-initial-corrupt-check: true
heartbeat-interval: 500
initial-advertise-peer-urls: https://172.26.50.135:2380
initial-cluster: harv-node1-483f6823=https://172.26.50.135:2380
initial-cluster-state: new
listen-client-urls: https://127.0.0.1:2379,https://172.26.50.135:2379
listen-metrics-urls: http://127.0.0.1:2381
listen-peer-urls: https://127.0.0.1:2380,https://172.26.50.135:2380
log-outputs:
- stderr
logger: zap
name: harv-node1-483f6823
peer-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt
snapshot-count: 10000

rancher@harv-node3:~> ps aux | grep " kubelet "
root     26821  4.3  0.0 835336 120900 ?       Sl   10:20   0:16 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --address=0.0.0.0 --alsologtostderr=false --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.53.0.10 --cluster-domain=cluster.local --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=harv-node3 --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --log-file=/var/lib/rancher/rke2/agent/logs/kubelet.log --log-file-max-size=50 --logtostderr=false --node-labels=harvesterhci.io/managed=true,rke.cattle.io/machine=920fc48d-3f2b-42e0-9001-5f8e8d492dbd --pod-infra-container-image=index.docker.io/rancher/pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --serialize-image-pulls=false --stderrthreshold=FATAL --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key

rancher@harv-node3:~> ps aux | grep " etcd "
rancher  34383  0.0  0.0  10248   748 pts/0    S+   10:28   0:00 grep  etcd


rancher@harv-node3:~> sudo cat /var/lib/rancher/rke2/server/db/etcd/config
advertise-client-urls: https://172.26.50.138:2379
client-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
data-dir: /var/lib/rancher/rke2/server/db/etcd
election-timeout: 5000
experimental-initial-corrupt-check: true
heartbeat-interval: 500
initial-cluster: harv-node1-483f6823=https://172.26.50.135:2380,harv-node3-1fec3f96=https://172.26.50.138:2380
initial-cluster-state: existing
listen-client-urls: https://127.0.0.1:2379,https://172.26.50.138:2379
listen-metrics-urls: http://127.0.0.1:2381
listen-peer-urls: https://127.0.0.1:2380,https://172.26.50.138:2380
log-outputs:
- stderr
logger: zap
name: harv-node3-1fec3f96
peer-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt
snapshot-count: 10000