kubernetes: etcd and kube-apiserver does not start after incorrect machine shutdown
I was told that this is correct issue tracker for my problem. Previously I’ve post this issue here
Environment
mcajkovs@ubuntu:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
mcajkovs@ubuntu:~$ uname -a
Linux ubuntu 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
mcajkovs@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic
mcajkovs@ubuntu:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:12:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Installation and setup
I’ve installed k8s on my virtual machine (VM) VMware Workstation using following steps:
swapoff -a
sudo apt-get update && sudo apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl taint nodes --all node-role.kubernetes.io/master-
curl https://docs.projectcalico.org/v3.10/manifests/calico.yaml -O
POD_CIDR="10.244.0.0/16" \
sed -i -e "s?192.168.0.0/16?$POD_CIDR?g" calico.yaml
kubectl apply -f calico.yaml
cat << EOF >> /var/lib/kubelet/config.yaml
evictionHard:
imagefs.available: 1%
memory.available: 100Mi
nodefs.available: 1%
nodefs.inodesFree: 1%
EOF
systemctl daemon-reload
systemctl restart kubelet
cat << EOF > /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
EOF
systemctl daemon-reload
systemctl restart docker
docker info | grep -i driver
Problem
k8s does not start after boot. As a a result I cannot use kubectl to communicate with k8s. I thing main problem is that apiserver is restarting and etcd is not started
mcajkovs@ubuntu:~$ docker ps -a | grep k8s
f0ba79d60407 41ef50a5f06a "kube-apiserver --ad…" 9 seconds ago Up 8 seconds k8s_kube-apiserver_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_133
1f91362f5c91 303ce5db0e90 "etcd --advertise-cl…" 23 seconds ago Exited (2) 22 seconds ago k8s_etcd_etcd-ubuntu_kube-system_94d759dceed198fa6db05be9ea52a98a_198
1c8c2f27bddb 41ef50a5f06a "kube-apiserver --ad…" 46 seconds ago Exited (2) 24 seconds ago k8s_kube-apiserver_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_132
4b02dcf082bf f52d4c527ef2 "kube-scheduler --au…" About an hour ago Up About an hour k8s_kube-scheduler_kube-scheduler-ubuntu_kube-system_9c994ea62a2d8d6f1bb7498f10aa6fcf_0
dd9c5e31d7c0 da5fd66c4068 "kube-controller-man…" About an hour ago Up About an hour k8s_kube-controller-manager_kube-controller-manager-ubuntu_kube-system_8482ef84d3b4e5e90f4462818c76a7e9_0
2aa1c151d65b k8s.gcr.io/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_0
98ecfd1e9825 k8s.gcr.io/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_etcd-ubuntu_kube-system_94d759dceed198fa6db05be9ea52a98a_0
284d3f50112a k8s.gcr.io/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_kube-scheduler-ubuntu_kube-system_9c994ea62a2d8d6f1bb7498f10aa6fcf_0
56f57710e623 k8s.gcr.io/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_kube-controller-manager-ubuntu_kube-system_8482ef84d3b4e5e90f4462818c76a7e9_0
mcajkovs@ubuntu:~$ kubectl get all -A
The connection to the server 192.168.195.130:6443 was refused - did you specify the right host or port?
mcajkovs@ubuntu:~$ journalctl -xeu kubelet
Feb 26 13:19:14 ubuntu kubelet[125589]: E0226 13:19:14.131947 125589 kubelet.go:2263] node "ubuntu" not found
Feb 26 13:19:21 ubuntu kubelet[125589]: E0226 13:19:21.802070 125589 kubelet_node_status.go:92] Unable to register node "ubuntu" with API server: Post https://192.168.195.130:6443/api/v1/nodes: net/http: TLS handshake timeout
Feb 26 13:19:28 ubuntu kubelet[125589]: E0226 13:19:28.684914 125589 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to get node info: node "ubuntu" not found
Feb 26 13:19:31 ubuntu kubelet[125589]: E0226 13:19:31.546593 125589 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://192.168.195.130:6443/api/v1/nodes?fieldSelector=metadata.name%3Dubuntu&limit=500&resourceVersion=0: dial tcp 192.168.195.130:6443: connect: connection refused
I’ve tried also following with same result:
sudo systemctl stop kubelet
docker ps -a | grep k8s_ | less -S | awk '{print $1}' | while read i; do docker rm $i -f; done
sudo systemctl start kubelet
What you expected to happen?
Start k8s after VM boot.
How to reproduce it (as minimally and precisely as possible)?
Shut down VM incorrectly (e.g. kill VM process, power off host machine etc.) and start VM again.
Anything else we need to know?
I have NOT observed this problem when I correctly shut down VM. But if VM is shut down incorrectly (killed process etc) then this happens. If I do kubectl reset and set up k8s according to above steps then it works.
content of /etc/kubernetes/manifests files
etcd.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://192.168.195.130:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https://192.168.195.130:2380
- --initial-cluster=ubuntu=https://192.168.195.130:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://192.168.195.130:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://192.168.195.130:2380
- --name=ubuntu
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
image: k8s.gcr.io/etcd:3.4.3-0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /health
port: 2381
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 15
name: etcd
resources: {}
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
hostNetwork: true
priorityClassName: system-cluster-critical
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-certs
- hostPath:
path: /var/lib/etcd
type: DirectoryOrCreate
name: etcd-data
status: {}
kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
- kube-apiserver
- --advertise-address=192.168.195.130
- --allow-privileged=true
- --authorization-mode=Node,RBAC
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --enable-admission-plugins=NodeRestriction
- --enable-bootstrap-token-auth=true
- --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
- --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
- --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
- --etcd-servers=https://127.0.0.1:2379
- --insecure-port=0
- --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
- --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt
- --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key
- --requestheader-allowed-names=front-proxy-client
- --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
- --requestheader-extra-headers-prefix=X-Remote-Extra-
- --requestheader-group-headers=X-Remote-Group
- --requestheader-username-headers=X-Remote-User
- --secure-port=6443
- --service-account-key-file=/etc/kubernetes/pki/sa.pub
- --service-cluster-ip-range=10.96.0.0/12
- --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
- --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
image: k8s.gcr.io/kube-apiserver:v1.17.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 192.168.195.130
path: /healthz
port: 6443
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-apiserver
resources:
requests:
cpu: 250m
volumeMounts:
- mountPath: /etc/ssl/certs
name: ca-certs
readOnly: true
- mountPath: /etc/ca-certificates
name: etc-ca-certificates
readOnly: true
- mountPath: /etc/pki
name: etc-pki
readOnly: true
- mountPath: /etc/kubernetes/pki
name: k8s-certs
readOnly: true
- mountPath: /usr/local/share/ca-certificates
name: usr-local-share-ca-certificates
readOnly: true
- mountPath: /usr/share/ca-certificates
name: usr-share-ca-certificates
readOnly: true
hostNetwork: true
priorityClassName: system-cluster-critical
volumes:
- hostPath:
path: /etc/ssl/certs
type: DirectoryOrCreate
name: ca-certs
- hostPath:
path: /etc/ca-certificates
type: DirectoryOrCreate
name: etc-ca-certificates
- hostPath:
path: /etc/pki
type: DirectoryOrCreate
name: etc-pki
- hostPath:
path: /etc/kubernetes/pki
type: DirectoryOrCreate
name: k8s-certs
- hostPath:
path: /usr/local/share/ca-certificates
type: DirectoryOrCreate
name: usr-local-share-ca-certificates
- hostPath:
path: /usr/share/ca-certificates
type: DirectoryOrCreate
name: usr-share-ca-certificates
status: {}
kube-controller-manager.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-controller-manager
tier: control-plane
name: kube-controller-manager
namespace: kube-system
spec:
containers:
- command:
- kube-controller-manager
- --allocate-node-cidrs=true
- --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
- --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
- --bind-address=127.0.0.1
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --cluster-cidr=10.244.0.0/16
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
- --controllers=*,bootstrapsigner,tokencleaner
- --kubeconfig=/etc/kubernetes/controller-manager.conf
- --leader-elect=true
- --node-cidr-mask-size=24
- --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
- --root-ca-file=/etc/kubernetes/pki/ca.crt
- --service-account-private-key-file=/etc/kubernetes/pki/sa.key
- --service-cluster-ip-range=10.96.0.0/12
- --use-service-account-credentials=true
image: k8s.gcr.io/kube-controller-manager:v1.17.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10257
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-controller-manager
resources:
requests:
cpu: 200m
volumeMounts:
- mountPath: /etc/ssl/certs
name: ca-certs
readOnly: true
- mountPath: /etc/ca-certificates
name: etc-ca-certificates
readOnly: true
- mountPath: /etc/pki
name: etc-pki
readOnly: true
- mountPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
name: flexvolume-dir
- mountPath: /etc/kubernetes/pki
name: k8s-certs
readOnly: true
- mountPath: /etc/kubernetes/controller-manager.conf
name: kubeconfig
readOnly: true
- mountPath: /usr/local/share/ca-certificates
name: usr-local-share-ca-certificates
readOnly: true
- mountPath: /usr/share/ca-certificates
name: usr-share-ca-certificates
readOnly: true
hostNetwork: true
priorityClassName: system-cluster-critical
volumes:
- hostPath:
path: /etc/ssl/certs
type: DirectoryOrCreate
name: ca-certs
- hostPath:
path: /etc/ca-certificates
type: DirectoryOrCreate
name: etc-ca-certificates
- hostPath:
path: /etc/pki
type: DirectoryOrCreate
name: etc-pki
- hostPath:
path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
type: DirectoryOrCreate
name: flexvolume-dir
- hostPath:
path: /etc/kubernetes/pki
type: DirectoryOrCreate
name: k8s-certs
- hostPath:
path: /etc/kubernetes/controller-manager.conf
type: FileOrCreate
name: kubeconfig
- hostPath:
path: /usr/local/share/ca-certificates
type: DirectoryOrCreate
name: usr-local-share-ca-certificates
- hostPath:
path: /usr/share/ca-certificates
type: DirectoryOrCreate
name: usr-share-ca-certificates
status: {}
kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
image: k8s.gcr.io/kube-scheduler:v1.17.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-cluster-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 6
- Comments: 51 (21 by maintainers)
Experiencing this as well after a power loss.
Why is this issue closed? There is no concrete solution found yet. I am facing the same issue today
Issues go stale after 90d of inactivity. Mark the issue as fresh with
/remove-lifecycle stale
. Stale issues rot after an additional 30d of inactivity and eventually close.If this issue is safe to close now please do so with
/close
.Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
same issue with
centos 7.7.1908
. etcd and apiserver restart many times . see logs likefor tmp solved
then to found the cluster is really cleaning, only static pod is here.
tips: just do backup when cluster is health!! just do backup when cluster is health!!just do backup when cluster is health!!
Is this what you want?
@fejta-bot: Closing this issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen
. Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-contributor-experience at kubernetes/community. /close
Stale issues rot after 30d of inactivity. Mark the issue as fresh with
/remove-lifecycle rotten
. Rotten issues close after an additional 30d of inactivity.If this issue is safe to close now please do so with
/close
.Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
I ended up doing a scheduled job as a workaround in the meantime with something like this to back up every 6 hours for 30 days:
Any updates on this? I have the same issue, with a single master
kubernetes
v1.19.0-rc.3
andetcd
v3.4.9-1
.I’ve tried to rename
/var/lib/etcd/member
to/var/lib/etcd/member.bak
and then isssuesudo systemctl restart kubelet
but after those steps I get only one service running in clusterIf I understand correct the issue is due to broken etcd database. Are there any best practices for periodically backing up etcd database during cluster operation? Or is this type of issue normal when cluster is incorrectly shut down? Or what is the best way to avoid this?