kubernetes: etcd and kube-apiserver does not start after incorrect machine shutdown

I was told that this is correct issue tracker for my problem. Previously I’ve post this issue here

Environment

mcajkovs@ubuntu:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

mcajkovs@ubuntu:~$ uname -a
Linux ubuntu 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

mcajkovs@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:        18.04
Codename:       bionic

mcajkovs@ubuntu:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:12:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

Installation and setup

I’ve installed k8s on my virtual machine (VM) VMware Workstation using following steps:

swapoff -a
sudo apt-get update && sudo apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF

sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl taint nodes --all node-role.kubernetes.io/master-
curl https://docs.projectcalico.org/v3.10/manifests/calico.yaml -O
POD_CIDR="10.244.0.0/16" \
sed -i -e "s?192.168.0.0/16?$POD_CIDR?g" calico.yaml
kubectl apply -f calico.yaml


cat << EOF >> /var/lib/kubelet/config.yaml
evictionHard:
  imagefs.available: 1%
  memory.available: 100Mi
  nodefs.available: 1%
  nodefs.inodesFree: 1%
EOF

systemctl daemon-reload
systemctl restart kubelet


cat << EOF > /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOF

systemctl daemon-reload
systemctl restart docker
docker info | grep -i driver

Problem

k8s does not start after boot. As a a result I cannot use kubectl to communicate with k8s. I thing main problem is that apiserver is restarting and etcd is not started

mcajkovs@ubuntu:~$ docker ps -a | grep k8s
f0ba79d60407        41ef50a5f06a                      "kube-apiserver --ad…"   9 seconds ago       Up 8 seconds                                                             k8s_kube-apiserver_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_133
1f91362f5c91        303ce5db0e90                      "etcd --advertise-cl…"   23 seconds ago      Exited (2) 22 seconds ago                                                k8s_etcd_etcd-ubuntu_kube-system_94d759dceed198fa6db05be9ea52a98a_198
1c8c2f27bddb        41ef50a5f06a                      "kube-apiserver --ad…"   46 seconds ago      Exited (2) 24 seconds ago                                                k8s_kube-apiserver_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_132
4b02dcf082bf        f52d4c527ef2                      "kube-scheduler --au…"   About an hour ago   Up About an hour                                                         k8s_kube-scheduler_kube-scheduler-ubuntu_kube-system_9c994ea62a2d8d6f1bb7498f10aa6fcf_0
dd9c5e31d7c0        da5fd66c4068                      "kube-controller-man…"   About an hour ago   Up About an hour                                                         k8s_kube-controller-manager_kube-controller-manager-ubuntu_kube-system_8482ef84d3b4e5e90f4462818c76a7e9_0
2aa1c151d65b        k8s.gcr.io/pause:3.1              "/pause"                 About an hour ago   Up About an hour                                                         k8s_POD_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_0
98ecfd1e9825        k8s.gcr.io/pause:3.1              "/pause"                 About an hour ago   Up About an hour                                                         k8s_POD_etcd-ubuntu_kube-system_94d759dceed198fa6db05be9ea52a98a_0
284d3f50112a        k8s.gcr.io/pause:3.1              "/pause"                 About an hour ago   Up About an hour                                                         k8s_POD_kube-scheduler-ubuntu_kube-system_9c994ea62a2d8d6f1bb7498f10aa6fcf_0
56f57710e623        k8s.gcr.io/pause:3.1              "/pause"                 About an hour ago   Up About an hour                                                         k8s_POD_kube-controller-manager-ubuntu_kube-system_8482ef84d3b4e5e90f4462818c76a7e9_0

mcajkovs@ubuntu:~$ kubectl get all -A
The connection to the server 192.168.195.130:6443 was refused - did you specify the right host or port?

mcajkovs@ubuntu:~$ journalctl -xeu kubelet
Feb 26 13:19:14 ubuntu kubelet[125589]: E0226 13:19:14.131947  125589 kubelet.go:2263] node "ubuntu" not found
Feb 26 13:19:21 ubuntu kubelet[125589]: E0226 13:19:21.802070  125589 kubelet_node_status.go:92] Unable to register node "ubuntu" with API server: Post https://192.168.195.130:6443/api/v1/nodes: net/http: TLS handshake timeout
Feb 26 13:19:28 ubuntu kubelet[125589]: E0226 13:19:28.684914  125589 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to get node info: node "ubuntu" not found
Feb 26 13:19:31 ubuntu kubelet[125589]: E0226 13:19:31.546593  125589 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://192.168.195.130:6443/api/v1/nodes?fieldSelector=metadata.name%3Dubuntu&limit=500&resourceVersion=0: dial tcp 192.168.195.130:6443: connect: connection refused

I’ve tried also following with same result:

sudo systemctl stop kubelet
docker ps -a | grep k8s_ | less -S | awk '{print $1}' | while read i; do docker rm $i -f; done
sudo systemctl start kubelet

What you expected to happen?

Start k8s after VM boot.

How to reproduce it (as minimally and precisely as possible)?

Shut down VM incorrectly (e.g. kill VM process, power off host machine etc.) and start VM again.

Anything else we need to know?

I have NOT observed this problem when I correctly shut down VM. But if VM is shut down incorrectly (killed process etc) then this happens. If I do kubectl reset and set up k8s according to above steps then it works.

content of /etc/kubernetes/manifests files

etcd.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.195.130:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://192.168.195.130:2380
    - --initial-cluster=ubuntu=https://192.168.195.130:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.195.130:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.195.130:2380
    - --name=ubuntu
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: k8s.gcr.io/etcd:3.4.3-0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: etcd
    resources: {}
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
status: {}

kube-apiserver.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-apiserver
    tier: control-plane
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-apiserver
    - --advertise-address=192.168.195.130
    - --allow-privileged=true
    - --authorization-mode=Node,RBAC
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --enable-admission-plugins=NodeRestriction
    - --enable-bootstrap-token-auth=true
    - --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
    - --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
    - --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
    - --etcd-servers=https://127.0.0.1:2379
    - --insecure-port=0
    - --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
    - --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
    - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
    - --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt
    - --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key
    - --requestheader-allowed-names=front-proxy-client
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --requestheader-extra-headers-prefix=X-Remote-Extra-
    - --requestheader-group-headers=X-Remote-Group
    - --requestheader-username-headers=X-Remote-User
    - --secure-port=6443
    - --service-account-key-file=/etc/kubernetes/pki/sa.pub
    - --service-cluster-ip-range=10.96.0.0/12
    - --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
    - --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
    image: k8s.gcr.io/kube-apiserver:v1.17.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 192.168.195.130
        path: /healthz
        port: 6443
        scheme: HTTPS
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-apiserver
    resources:
      requests:
        cpu: 250m
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ca-certs
      readOnly: true
    - mountPath: /etc/ca-certificates
      name: etc-ca-certificates
      readOnly: true
    - mountPath: /etc/pki
      name: etc-pki
      readOnly: true
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
    - mountPath: /usr/local/share/ca-certificates
      name: usr-local-share-ca-certificates
      readOnly: true
    - mountPath: /usr/share/ca-certificates
      name: usr-share-ca-certificates
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/ssl/certs
      type: DirectoryOrCreate
    name: ca-certs
  - hostPath:
      path: /etc/ca-certificates
      type: DirectoryOrCreate
    name: etc-ca-certificates
  - hostPath:
      path: /etc/pki
      type: DirectoryOrCreate
    name: etc-pki
  - hostPath:
      path: /etc/kubernetes/pki
      type: DirectoryOrCreate
    name: k8s-certs
  - hostPath:
      path: /usr/local/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-local-share-ca-certificates
  - hostPath:
      path: /usr/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-share-ca-certificates
status: {}

kube-controller-manager.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --allocate-node-cidrs=true
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=127.0.0.1
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-cidr=10.244.0.0/16
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --node-cidr-mask-size=24
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --service-cluster-ip-range=10.96.0.0/12
    - --use-service-account-credentials=true
    image: k8s.gcr.io/kube-controller-manager:v1.17.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-controller-manager
    resources:
      requests:
        cpu: 200m
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ca-certs
      readOnly: true
    - mountPath: /etc/ca-certificates
      name: etc-ca-certificates
      readOnly: true
    - mountPath: /etc/pki
      name: etc-pki
      readOnly: true
    - mountPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      name: flexvolume-dir
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
    - mountPath: /etc/kubernetes/controller-manager.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /usr/local/share/ca-certificates
      name: usr-local-share-ca-certificates
      readOnly: true
    - mountPath: /usr/share/ca-certificates
      name: usr-share-ca-certificates
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/ssl/certs
      type: DirectoryOrCreate
    name: ca-certs
  - hostPath:
      path: /etc/ca-certificates
      type: DirectoryOrCreate
    name: etc-ca-certificates
  - hostPath:
      path: /etc/pki
      type: DirectoryOrCreate
    name: etc-pki
  - hostPath:
      path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      type: DirectoryOrCreate
    name: flexvolume-dir
  - hostPath:
      path: /etc/kubernetes/pki
      type: DirectoryOrCreate
    name: k8s-certs
  - hostPath:
      path: /etc/kubernetes/controller-manager.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /usr/local/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-local-share-ca-certificates
  - hostPath:
      path: /usr/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-share-ca-certificates
status: {}

kube-scheduler.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    image: k8s.gcr.io/kube-scheduler:v1.17.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
status: {}

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 51 (21 by maintainers)

Most upvoted comments

Experiencing this as well after a power loss.

+10

Gabisonfire on Aug 6, 2020

Why is this issue closed? There is no concrete solution found yet. I am facing the same issue today

MOHAMMEDSADIQ-infrrd on Apr 4, 2022

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot on May 8, 2021

same issue with centos 7.7.1908. etcd and apiserver restart many times . see logs like

 I | etcdserver: recovered store from snapshot at index 164056893
2021-02-07 06:32:51.520022 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xb8cb90]

for tmp solved

$ mv /var/lib/etcd/member  /var/lib/etcd/member.bak
$ systemctl restart kubelet

then to found the cluster is really cleaning, only static pod is here.

$ kubectl get pod -A
NAMESPACE     NAME                             READY   STATUS    RESTARTS   AGE
kube-system   etcd-kube11                      1/1     Running   1759       83s
kube-system   kube-apiserver-kube11            1/1     Running   1653       94s
kube-system   kube-controller-manager-kube11   1/1     Running   2          97s
kube-system   kube-scheduler-kube11            1/1     Running   2          90s
kube-system   kube-sealyun-lvscare-kube12      1/1     Running   2          115s
kube-system   kube-sealyun-lvscare-kube13      1/1     Running   2          95s

$ cat /etc/redhat-release 
CentOS Linux release 7.7.1908 (Core)
$ uname -a
Linux kube11 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

tips: just do backup when cluster is health!! just do backup when cluster is health!!just do backup when cluster is health!!

oldthreefeng on Feb 7, 2021

What is the content of the etcd container logs?

/cc @jpbetz

Is this what you want?

mcajkovs@ubuntu:~$ docker ps -a | grep etcd | awk '{print $NF}' | while read i; do docker logs -t $i; done
2020-02-26T13:48:24.613993533Z [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-02-26T13:48:24.614022261Z 2020-02-26 13:48:24.613797 I | etcdmain: etcd Version: 3.4.3
2020-02-26T13:48:24.614026299Z 2020-02-26 13:48:24.613832 I | etcdmain: Git SHA: 3cf2f69b5
2020-02-26T13:48:24.614029070Z 2020-02-26 13:48:24.613835 I | etcdmain: Go Version: go1.12.12
2020-02-26T13:48:24.614031856Z 2020-02-26 13:48:24.613837 I | etcdmain: Go OS/Arch: linux/amd64
2020-02-26T13:48:24.614034475Z 2020-02-26 13:48:24.613840 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-02-26T13:48:24.614037246Z 2020-02-26 13:48:24.613890 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-02-26T13:48:24.614039948Z [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-02-26T13:48:24.614042652Z 2020-02-26 13:48:24.613914 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-02-26T13:48:24.615877713Z 2020-02-26 13:48:24.614428 I | embed: name = ubuntu
2020-02-26T13:48:24.615887178Z 2020-02-26 13:48:24.614451 I | embed: data dir = /var/lib/etcd
2020-02-26T13:48:24.615890051Z 2020-02-26 13:48:24.614481 I | embed: member dir = /var/lib/etcd/member
2020-02-26T13:48:24.615892716Z 2020-02-26 13:48:24.614503 I | embed: heartbeat = 100ms
2020-02-26T13:48:24.615895307Z 2020-02-26 13:48:24.614505 I | embed: election = 1000ms
2020-02-26T13:48:24.615897823Z 2020-02-26 13:48:24.614508 I | embed: snapshot count = 10000
2020-02-26T13:48:24.615900397Z 2020-02-26 13:48:24.614528 I | embed: advertise client URLs = https://192.168.195.130:2379
2020-02-26T13:48:24.615903073Z 2020-02-26 13:48:24.614531 I | embed: initial advertise peer URLs = https://192.168.195.130:2380
2020-02-26T13:48:24.615905720Z 2020-02-26 13:48:24.614536 I | embed: initial cluster =
2020-02-26T13:48:24.619906274Z 2020-02-26 13:48:24.619596 I | etcdserver: recovered store from snapshot at index 470047
2020-02-26T13:48:24.702625099Z 2020-02-26 13:48:24.702261 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
2020-02-26T13:48:24.705993632Z panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
2020-02-26T13:48:24.706007692Z  panic: runtime error: invalid memory address or nil pointer dereference
2020-02-26T13:48:24.706011346Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc2cc4e]
2020-02-26T13:48:24.706014180Z
2020-02-26T13:48:24.706016948Z goroutine 1 [running]:
2020-02-26T13:48:24.706176000Z go.etcd.io/etcd/etcdserver.NewServer.func1(0xc000244f50, 0xc000242f48)
2020-02-26T13:48:24.706183200Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/server.go:335 +0x3e
2020-02-26T13:48:24.706298544Z panic(0xed6960, 0xc0002ac360)
2020-02-26T13:48:24.706315784Z  /usr/local/go/src/runtime/panic.go:522 +0x1b5
2020-02-26T13:48:24.706319006Z github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc0001b8e40, 0x10aeaf5, 0x2a, 0xc000243018, 0x1, 0x1)
2020-02-26T13:48:24.706427145Z  /home/ec2-user/go/pkg/mod/github.com/coreos/pkg@v0.0.0-20160727233714-3ac0863d7acf/capnslog/pkg_logger.go:75 +0x135
2020-02-26T13:48:24.706562923Z go.etcd.io/etcd/etcdserver.NewServer(0x7ffd68579e7e, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc000129200, 0x1, 0x1, 0xc000129380, ...)
2020-02-26T13:48:24.706713866Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/server.go:456 +0x42f7
2020-02-26T13:48:24.706720809Z go.etcd.io/etcd/embed.StartEtcd(0xc00016f600, 0xc00016fb80, 0x0, 0x0)
2020-02-26T13:48:24.706913150Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/embed/etcd.go:211 +0x9d0
2020-02-26T13:48:24.706919869Z go.etcd.io/etcd/etcdmain.startEtcd(0xc00016f600, 0x108423e, 0x6, 0x1, 0xc0001d51d0)
2020-02-26T13:48:24.706922776Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
2020-02-26T13:48:24.707078785Z go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
2020-02-26T13:48:24.707085556Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:144 +0x2f71
2020-02-26T13:48:24.707088542Z go.etcd.io/etcd/etcdmain.Main()
2020-02-26T13:48:24.707091151Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/main.go:46 +0x38
2020-02-26T13:48:24.707176060Z main.main()
2020-02-26T13:48:24.707182254Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/main.go:28 +0x20

mcajkovs on Feb 26, 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on Jul 7, 2021

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

fejta-bot on Jul 7, 2021

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot on Jun 7, 2021

I ended up doing a scheduled job as a workaround in the meantime with something like this to back up every 6 hours for 30 days:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: backup
  namespace: kube-system
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - args:
            - -c
            - etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
              --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
              snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db
            command:
            - /bin/sh
            env:
            - name: ETCDCTL_API
              value: "3"
            image: k8s.gcr.io/etcd:3.4.3-0
            imagePullPolicy: IfNotPresent
            name: backup
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /etc/kubernetes/pki/etcd
              name: etcd-certs
              readOnly: true
            - mountPath: /backup
              name: backup
          - args:
            - -c
            - find /backup -type f -mtime +30 -exec rm -f {} \;
            command:
            - /bin/sh
            env:
            - name: ETCDCTL_API
              value: "3"
            image: k8s.gcr.io/etcd:3.4.3-0
            imagePullPolicy: IfNotPresent
            name: cleanup
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /backup
              name: backup
          dnsPolicy: ClusterFirst
          hostNetwork: true
          nodeName: YOUR_MASTER_NODE_NAME
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - hostPath:
              path: /etc/kubernetes/pki/etcd
              type: DirectoryOrCreate
            name: etcd-certs
          - hostPath:
              path: /opt/etcd_backups
              type: DirectoryOrCreate
            name: backup
  schedule: 0 */6 * * *
  successfulJobsHistoryLimit: 3
  suspend: false

Gabisonfire on Sep 30, 2020

Any updates on this? I have the same issue, with a single master kubernetes v1.19.0-rc.3 and etcd v3.4.9-1.

lbogdan on Aug 6, 2020

I’ve tried to rename /var/lib/etcd/member to /var/lib/etcd/member.bak and then isssue sudo systemctl restart kubelet but after those steps I get only one service running in cluster

mcajkovs@ubuntu:~$ kubectl get all -A
NAMESPACE   NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
default     service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   61s

If I understand correct the issue is due to broken etcd database. Are there any best practices for periodically backing up etcd database during cluster operation? Or is this type of issue normal when cluster is incorrectly shut down? Or what is the best way to avoid this?

mcajkovs on Feb 27, 2020