kubernetes: Kubernetes API Server cannot be started after improper reboot

What happened?

Hello! I’ve got a simple Kubernetes cluster (version 1.23.1) with 1 - master node and 2 - worker nodes. Everything worked fine until I’ve accidentally rebooted my host computer with all Virtual machines. After VMs were started again I see that kubelet starts running as before, but commands such as:

kubectl get nodes, kubectl get pods return “The connection to the server myhost:6443 was refused - did you specify the right host or port?”. Then I’ve tried: systemctl status kubelet, journalctl -xeu kubelet and didn’t see something special. It writes all the time that “Error getting node” err=“node "master01" not found” and periodically: Error syncing pod, skipping" err="failed to "StartContainer" for "etcd" with CrashLoopBackOff. If I remember, such messages emerged sometimes even when the cluster worked.

Then, I’ve decided to check netstat: netstat -tupan | grep LISTEN and see that Recv-Q increases and record :::6443 disappears after Recv-Q incremented to 25-30.

tcp6 17 0 :::6443 :::* LISTEN -

Whenever I’ve tried to execute systemctl restart kubelet record :::6443 appears, Recv-Q incremented and removed again from netstat result.

/sig api-machinery /sig kind/bug

What did you expect to happen?

Reboot nodes and everything will work as before.

How can we reproduce it (as minimally and precisely as possible)?

In my case: VM: VMWare Player 16. I’ve got one master node and two worker nodes. When I’ve accidentally turned off my host machine without previously turning off or suspend my VMs (power outage in my case, when properly turn off the host it may preserve the state of the VMs and then successfully restore them). After all, when you turn them on, you’ll see that kubelet started and working, but API server is down (netstat doesn’t show it in the result of listening services).

Anything else we need to know?

No response

Kubernetes version

1.23.1

Cloud provider

Self-hosted

OS version

```console # On Linux: $ cat /etc/os-release NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.3 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal $ uname -a Linux master 5.4.0-92-generic #103-Ubuntu SMP Fri Nov 26 16:13:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

apt

Container runtime (CRI) and and version (if applicable)

crio version 1.17.5 commit: "251a47ba0930c28e83dea8d409b79f568dd711aa-dirty"

Related plugins (CNI, CSI, …) and versions (if applicable)

Calico self-managed on premises for Kubernetes.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 18 (10 by maintainers)

Most upvoted comments

@zbogdan7 , is you etcd is running properly, maybe it can be an issue related to non-graceful termination of etcd.

Yes, that’s right, the termination of entire cluster was non-graceful in my case. After, I couldn’t start it again and had to re-initialize the cluster. But I’d like to know how to start it again, even if I take a risk of lose some data? P.S. Also I wouldn’t mind if you suggest me please some standard configurations to resolve such non-graceful terminations and unpredictable restart/shutdown.

Can you first check if etcd is able to start? If not then it’s the issue I described here. If you don’t have any important data that you want to recover, then you could simply delete the etcd data directory (usually at /var/etcd/data ).

For confirmation can you post the log of etcd startup?

Thanks, I’ve already reinitialized the cluster. I’ll try to reproduce the issue and let you know about results.

bbohdan7 on Feb 10, 2022