rook: Kubernetes node crash and power down of nodes left OSD in unrecoverable state
Is this a bug report or feature request? Bug Report
Deviation from expected behavior: Rook OSD are in CrashLoopBackOff state and never recover.
Expected behavior:
How to reproduce it (minimal and precise):
Unknown. Cause likely to be a forced power down of kubernetes cluster.
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary - Operator’s logs, if necessary
- Crashing pod(s) logs, if necessary
To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-559d79c8fc-nld7x 1/1 Running 6 10h
kube-system calico-node-2r24j 1/1 Running 6 6d13h
kube-system calico-node-4bszt 1/1 Running 6 6d13h
kube-system calico-node-d4j8m 1/1 Running 6 6d13h
kube-system calico-node-wmlws 1/1 Running 6 6d13h
kube-system coredns-58687784f9-hzrpf 1/1 Running 5 6d13h
kube-system coredns-58687784f9-k2m98 1/1 Running 5 6d13h
kube-system dns-autoscaler-79599df498-lmlx6 1/1 Running 5 6d13h
kube-system kube-apiserver-nuc9034 1/1 Running 6 6d13h
kube-system kube-apiserver-nuc9037 1/1 Running 14 6d13h
kube-system kube-controller-manager-nuc9034 1/1 Running 20 6d13h
kube-system kube-controller-manager-nuc9037 1/1 Running 10 6d13h
kube-system kube-proxy-5hmzl 1/1 Running 5 6d13h
kube-system kube-proxy-9pd9h 1/1 Running 5 6d13h
kube-system kube-proxy-f2sn7 1/1 Running 5 6d13h
kube-system kube-proxy-xnlfm 1/1 Running 5 6d13h
kube-system kube-scheduler-nuc9034 1/1 Running 19 6d13h
kube-system kube-scheduler-nuc9037 1/1 Running 10 6d13h
kube-system metrics-server-57f8f5d4bd-ccnnv 1/1 Running 7 10h
kube-system nginx-proxy-nuc9035 1/1 Running 5 6d13h
kube-system nginx-proxy-nuc9036 1/1 Running 5 6d13h
kube-system nodelocaldns-78jmm 1/1 Running 6 6d13h
kube-system nodelocaldns-fn4wv 1/1 Running 5 6d13h
kube-system nodelocaldns-x9898 1/1 Running 6 6d13h
kube-system nodelocaldns-xztdh 1/1 Running 5 6d13h
kubernetes-dashboard dashboard-metrics-scraper-76585494d8-ccz56 1/1 Running 6 6d13h
kubernetes-dashboard kubernetes-dashboard-5996555fd8-krvwc 1/1 Running 11 6d13h
metallb-system controller-5c9894b5cd-n99gg 1/1 Running 5 10h
metallb-system speaker-27rft 1/1 Running 6 10h
metallb-system speaker-djkq2 1/1 Running 8 10h
metallb-system speaker-jvbtc 1/1 Running 6 10h
metallb-system speaker-v297f 1/1 Running 6 6h34m
monitoring grafana-1587683274-5b647966f7-dnb6v 1/1 Running 5 2d3h
monitoring prometheus-1587682577-alertmanager-8646447659-znz4k 0/2 Error 0 2d3h
monitoring prometheus-1587682577-kube-state-metrics-69fd5fbb6c-wd54b 1/1 Running 9 2d3h
monitoring prometheus-1587682577-node-exporter-7p875 1/1 Running 5 2d3h
monitoring prometheus-1587682577-node-exporter-dzqnb 1/1 Running 5 2d3h
monitoring prometheus-1587682577-node-exporter-hcj7q 1/1 Running 5 2d3h
monitoring prometheus-1587682577-node-exporter-q9bpg 1/1 Running 5 2d3h
monitoring prometheus-1587682577-pushgateway-64bfddc9cb-8c8lk 1/1 Running 5 10h
monitoring prometheus-1587682577-server-b99b579d4-628x4 0/2 ContainerCreating 0 10h
rook-ceph rook-ceph-crashcollector-nuc9034-59d66db5fc-zb7wb 1/1 Running 5 3d1h
rook-ceph rook-ceph-crashcollector-nuc9035-75c4dcb44-ktfrj 1/1 Running 4 10h
rook-ceph rook-ceph-crashcollector-nuc9036-6865445c45-brg4j 1/1 Running 5 3d1h
rook-ceph rook-ceph-crashcollector-nuc9037-78ddc7455d-pxnvw 1/1 Running 5 3d1h
rook-ceph rook-ceph-mgr-a-9c96fc695-h6lnq 0/1 Init:1/3 3 3d1h
rook-ceph rook-ceph-mon-a-7d5fbf7fc8-5w6c6 1/1 Running 5 3d1h
rook-ceph rook-ceph-mon-b-84f7f75666-mpb62 1/1 Running 5 3d1h
rook-ceph rook-ceph-mon-d-698d4c7b5-9fcd8 1/1 Running 5 10h
rook-ceph rook-ceph-operator-665fff7c74-6jgnd 0/1 CrashLoopBackOff 54 3h29m
rook-ceph rook-ceph-osd-0-6bb9f7fb4-mrtm7 0/1 CrashLoopBackOff 140 3d1h
rook-ceph rook-ceph-osd-1-8f8877d64-qzx74 0/1 CrashLoopBackOff 138 3d1h
rook-ceph rook-ceph-osd-2-6c57c5ccf9-gxzgq 0/1 CrashLoopBackOff 137 10h
rook-ceph rook-ceph-osd-3-799fcffc75-stmxj 0/1 CrashLoopBackOff 138 3d1h
rook-ceph rook-ceph-osd-prepare-nuc9034-jkljm 0/1 Completed 0 11h
rook-ceph rook-ceph-osd-prepare-nuc9036-bjhpt 0/1 Completed 0 11h
rook-ceph rook-ceph-osd-prepare-nuc9037-727rd 0/1 Completed 0 11h
rook-ceph rook-ceph-rgw-my-store-a-85bb598bbd-vrflv 0/1 CrashLoopBackOff 122 3d1h
kubectl describe pod -n rook-ceph rook-ceph-osd-1-8f8877d64-qzx74
Name: rook-ceph-osd-1-8f8877d64-qzx74
Namespace: rook-ceph
Priority: 0
Node: nuc9034/192.168.1.200
Start Time: Wed, 22 Apr 2020 21:17:45 -0400
Labels: app=rook-ceph-osd
ceph-osd-id=1
failure-domain=nuc9034
pod-template-hash=8f8877d64
portable=false
rook_cluster=rook-ceph
Annotations: <none>
Status: Running
IP: 10.233.110.115
IPs:
IP: 10.233.110.115
Controlled By: ReplicaSet/rook-ceph-osd-1-8f8877d64
Init Containers:
activate:
Container ID: docker://6ea02e93b880d5fe056823f131ee8e902c852fb873e138cbe659cadb71402fe6
Image: ceph/ceph:v14.2.9
Image ID: docker-pullable://ceph/ceph@sha256:e633820ab8372a967a70ed72675a2e489536eb9204994d6aecf92a94b260d2ee
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
set -ex
OSD_ID=1
OSD_UUID=9bd30af3-b5e8-4143-9368-d0f065818170
OSD_STORE_FLAG="--bluestore"
OSD_DATA_DIR=/var/lib/ceph/osd/ceph-"$OSD_ID"
CV_MODE=lvm
DEVICE=
METADATA_DEVICE="$ROOK_METADATA_DEVICE"
# active the osd with ceph-volume
if [[ "$CV_MODE" == "lvm" ]]; then
TMP_DIR=$(mktemp -d)
# activate osd
ceph-volume "$CV_MODE" activate --no-systemd "$OSD_STORE_FLAG" "$OSD_ID" "$OSD_UUID"
# copy the tmpfs directory to a temporary directory
# this is needed because when the init container exits, the tmpfs goes away and its content with it
# this will result in the emptydir to be empty when accessed by the main osd container
cp --verbose --no-dereference "$OSD_DATA_DIR"/* "$TMP_DIR"/
# unmount the tmpfs since we don't need it anymore
umount "$OSD_DATA_DIR"
# copy back the content of the tmpfs into the original osd directory
cp --verbose --no-dereference "$TMP_DIR"/* "$OSD_DATA_DIR"
# retain ownership of files to the ceph user/group
chown --verbose --recursive ceph:ceph "$OSD_DATA_DIR"
# remove the temporary directory
rm --recursive --force "$TMP_DIR"
else
ARGS=(--device ${DEVICE} --no-systemd --no-tmpfs)
if [ -n "$METADATA_DEVICE" ]; then
ARGS+=(--block.db ${METADATA_DEVICE})
fi
# ceph-volume raw mode only supports bluestore so we don't need to pass a store flag
ceph-volume "$CV_MODE" activate "${ARGS[@]}"
fi
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 25 Apr 2020 22:11:26 -0400
Finished: Sat, 25 Apr 2020 22:11:31 -0400
Ready: True
Restart Count: 0
Environment:
CEPH_VOLUME_DEBUG: 1
CEPH_VOLUME_SKIP_RESTORECON: 1
DM_DISABLE_UDEV: 1
ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false
CEPH_ARGS: -m $(ROOK_CEPH_MON_HOST)
Mounts:
/dev from devices (rw)
/etc/ceph from rook-config-override (ro)
/var/lib/ceph/osd/ceph-1 from activate-osd (rw)
/var/run/secrets/kubernetes.io/serviceaccount from rook-ceph-osd-token-6bgrb (ro)
chown-container-data-dir:
Container ID: docker://8131b2ae208b744980645321bc2385c6639db9e60f903f2c4439ce9b51e00a38
Image: ceph/ceph:v14.2.9
Image ID: docker-pullable://ceph/ceph@sha256:e633820ab8372a967a70ed72675a2e489536eb9204994d6aecf92a94b260d2ee
Port: <none>
Host Port: <none>
Command:
chown
Args:
--verbose
--recursive
ceph:ceph
/var/log/ceph
/var/lib/ceph/crash
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 25 Apr 2020 22:11:32 -0400
Finished: Sat, 25 Apr 2020 22:11:32 -0400
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/dev from devices (rw)
/etc/ceph from rook-config-override (ro)
/run/udev from run-udev (rw)
/var/lib/ceph/crash from rook-ceph-crash (rw)
/var/lib/ceph/osd/ceph-1 from activate-osd (rw)
/var/lib/rook from rook-data (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from rook-ceph-osd-token-6bgrb (ro)
Containers:
osd:
Container ID: docker://a90bb3c7e90389e2191038d8083254bd459fa57eb5c373e28b83f3ff1a8b37df
Image: ceph/ceph:v14.2.9
Image ID: docker-pullable://ceph/ceph@sha256:e633820ab8372a967a70ed72675a2e489536eb9204994d6aecf92a94b260d2ee
Port: <none>
Host Port: <none>
Command:
ceph-osd
Args:
--foreground
--id
1
--fsid
38a10ef8-3e5a-4dde-9602-7c82d57f9d6e
--setuser
ceph
--setgroup
ceph
--crush-location=root=default host=nuc9034
--default-log-to-file
false
--ms-learn-addr-from-peer=false
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Sat, 25 Apr 2020 22:30:07 -0400
Finished: Sat, 25 Apr 2020 22:30:37 -0400
Ready: False
Restart Count: 138
Liveness: exec [env -i sh -c ceph --admin-daemon /run/ceph/ceph-osd.1.asok status] delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
ROOK_NODE_NAME: nuc9034
ROOK_CLUSTER_ID: 48efa415-73d9-4ddf-8259-c1ec4d088bd1
ROOK_PRIVATE_IP: (v1:status.podIP)
ROOK_PUBLIC_IP: (v1:status.podIP)
ROOK_CLUSTER_NAME: rook-ceph
ROOK_MON_ENDPOINTS: <set to the key 'data' of config map 'rook-ceph-mon-endpoints'> Optional: false
ROOK_MON_SECRET: <set to the key 'mon-secret' in secret 'rook-ceph-mon'> Optional: false
ROOK_ADMIN_SECRET: <set to the key 'admin-secret' in secret 'rook-ceph-mon'> Optional: false
ROOK_CONFIG_DIR: /var/lib/rook
ROOK_CEPH_CONFIG_OVERRIDE: /etc/rook/config/override.conf
ROOK_FSID: <set to the key 'fsid' in secret 'rook-ceph-mon'> Optional: false
NODE_NAME: (v1:spec.nodeName)
ROOK_CRUSHMAP_HOSTNAME: nuc9034
CEPH_VOLUME_DEBUG: 1
CEPH_VOLUME_SKIP_RESTORECON: 1
DM_DISABLE_UDEV: 1
TINI_SUBREAPER:
CONTAINER_IMAGE: ceph/ceph:v14.2.9
POD_NAME: rook-ceph-osd-1-8f8877d64-qzx74 (v1:metadata.name)
POD_NAMESPACE: rook-ceph (v1:metadata.namespace)
NODE_NAME: (v1:spec.nodeName)
POD_MEMORY_LIMIT: node allocatable (limits.memory)
POD_MEMORY_REQUEST: 0 (requests.memory)
POD_CPU_LIMIT: node allocatable (limits.cpu)
POD_CPU_REQUEST: 0 (requests.cpu)
ROOK_OSD_UUID: 9bd30af3-b5e8-4143-9368-d0f065818170
ROOK_OSD_ID: 1
ROOK_OSD_STORE_TYPE: bluestore
ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false
CEPH_ARGS: -m $(ROOK_CEPH_MON_HOST)
Mounts:
/dev from devices (rw)
/etc/ceph from rook-config-override (ro)
/run/udev from run-udev (rw)
/var/lib/ceph/crash from rook-ceph-crash (rw)
/var/lib/ceph/osd/ceph-1 from activate-osd (rw)
/var/lib/rook from rook-data (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from rook-ceph-osd-token-6bgrb (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
rook-data:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook
HostPathType:
rook-config-override:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rook-config-override
Optional: false
rook-ceph-log:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/rook-ceph/log
HostPathType:
rook-ceph-crash:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/rook-ceph/crash
HostPathType:
devices:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
run-udev:
Type: HostPath (bare host directory volume)
Path: /run/udev
HostPathType:
activate-osd:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
rook-ceph-osd-token-6bgrb:
Type: Secret (a volume populated by a Secret)
SecretName: rook-ceph-osd-token-6bgrb
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=nuc9034
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 73m (x132 over 3h17m) kubelet, nuc9034 Liveness probe failed: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
Normal Pulled 43m (x53 over 3h18m) kubelet, nuc9034 Container image "ceph/ceph:v14.2.9" already present on machine
Warning BackOff 28m (x672 over 3h16m) kubelet, nuc9034 Back-off restarting failed container
Warning FailedCreatePodSandBox 24m kubelet, nuc9034 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ac4ae61c2e438d3dc815c62856b1a6b858086a31d8e837847cd0d7837b1364fa" network for pod "rook-ceph-osd-1-8f8877d64-qzx74": networkPlugin cni failed to set up pod "rook-ceph-osd-1-8f8877d64-qzx74_rook-ceph" network: Get https://[10.233.0.1]:443/api/v1/namespaces/rook-ceph: dial tcp 10.233.0.1:443: connect: connection refused
Normal SandboxChanged 23m (x2 over 24m) kubelet, nuc9034 Pod sandbox changed, it will be killed and re-created.
Normal Created 23m kubelet, nuc9034 Created container activate
Normal Pulled 23m kubelet, nuc9034 Container image "ceph/ceph:v14.2.9" already present on machine
Normal Started 23m kubelet, nuc9034 Started container activate
Normal Started 23m kubelet, nuc9034 Started container chown-container-data-dir
Normal Created 23m kubelet, nuc9034 Created container chown-container-data-dir
Normal Pulled 23m kubelet, nuc9034 Container image "ceph/ceph:v14.2.9" already present on machine
Normal Started 23m (x2 over 23m) kubelet, nuc9034 Started container osd
Normal Pulled 22m (x3 over 23m) kubelet, nuc9034 Container image "ceph/ceph:v14.2.9" already present on machine
Normal Created 22m (x3 over 23m) kubelet, nuc9034 Created container osd
Warning Unhealthy 22m (x6 over 23m) kubelet, nuc9034 Liveness probe failed: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
Normal Killing 22m (x2 over 23m) kubelet, nuc9034 Container osd failed liveness probe, will be restarted
Warning BackOff 3m56s (x74 over 22m) kubelet, nuc9034 Back-off restarting failed container
Environment:
- OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
-
Kernel (e.g.
uname -a):Linux nuc9034 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux -
Cloud provider or hardware configuration: 4 DC53427HYE NUCs
-
Rook version (use
rook versioninside of a Rook Pod):
[root@rook-ceph-tools-7f96779fb9-n7tbk /]# rook version
rook: v1.3.0-beta.0.167.g8727807
go: go1.13.8
- Storage backend version (e.g. for ceph do
ceph -v):ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable) - Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T23:35:15Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.8", GitCommit:"ec6eb119b81be488b030e849b9e64fda4caaf33c", GitTreeState:"clean", BuildDate:"2020-03-12T20:52:22Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
-
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): DIY via KubeSpray using Calico
-
Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):[root@rook-ceph-tools-7f96779fb9-n7tbk /]# ceph healthNever completes…
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 23 (6 by maintainers)
Thank you, at least it’s been identified as a calico issue. Very much appreciate your help on this.