rook: Liveness probe failed: no valid command found; 10 closest matches:
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: Liveness probe seems to be attempting to execute something that doesn’t exist
Expected behavior: Liveness probe execution completes
How to reproduce it (minimal and precise):
Possibly related – Put enough disk pressure on nodes running the etcd containers to cause HA cluster to round-robin etcd responsibilities between control-plane nodes. Otherwise, the node running the osd pod might have been under resource pressure (cpu at least). File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary - Operator’s logs, if necessary
- Crashing pod(s) logs, if necessary
events from
k -n rook-ceph describe pod rook-ceph-osd-0-6b48dcbdc5-hp2pr
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 14m kubelet Liveness probe failed: no valid command found; 10 closest matches:
0
1
2
abort
assert
bluefs debug_inject_read_zeros
bluefs files list
bluefs stats
bluestore allocator dump block
bluestore allocator fragmentation block
admin_socket: invalid command
Normal Pulled 11m (x3 over 15m) kubelet Container image "quay.io/ceph/ceph:v16.2.6" already present on machine
Normal Created 11m (x3 over 15m) kubelet Created container osd
Normal Started 11m (x3 over 15m) kubelet Started container osd
Normal Killing 10m (x4 over 15m) kubelet Container osd failed liveness probe, will be restarted
Warning Unhealthy 9m48s (x21 over 2d8h) kubelet Liveness probe failed:
Warning BackOff 25s (x28 over 6m57s) kubelet Back-off restarting failed container
Oddly, when I attempted to run the liveness probe myself, I did not get that particular error. Did I run a different liveness probe?
k -n rook-ceph exec rook-ceph-osd-0-6b48dcbdc5-hp2pr --container osd -it -- env -i sh -c ceph --admin-daemon /ron/ceph/ceph-osd.0.asok status
unable to get monitor info from DNS SRV with service name: ceph-mon
2021-12-01T08:12:56.464+0000 ffff7d3fa1e0 -1 failed for service _ceph-mon._tcp
2021-12-01T08:12:56.464+0000 ffff7d3fa1e0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
[errno 2] RADOS object not found (error connecting to the cluster)
command terminated with exit code 1
I eventually removed the pods that were hammering the node and deleted this pod to have k8s re-create since it was in “back-off” mode. The re-created pod came up without any issues, but I thought I’d at least report this.
Re-running the liveness probe on the newly created pod still gives this though:
k -n rook-ceph exec rook-ceph-osd-0-6b48dcbdc5-lb9qm --container osd -it -- env -i sh -c ceph --admin-daemon /ron/ceph/ceph-osd.0.asok status
unable to get monitor info from DNS SRV with service name: ceph-mon
2021-12-01T08:51:06.503+0000 ffff9193e1e0 -1 failed for service _ceph-mon._tcp
2021-12-01T08:51:06.507+0000 ffff9193e1e0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
[errno 2] RADOS object not found (error connecting to the cluster)
command terminated with exit code 1
Which is odd since ceph status shows that everything is fine:
cluster:
id: 3d957ef3-6713-4032-9aa3-88958dd2cb5f
health: HEALTH_OK
services:
mon: 3 daemons, quorum p,q,r (age 33m)
mgr: a(active, since 2d)
mds: 1/1 daemons up, 1 hot standby
osd: 6 osds: 6 up (since 30m), 6 in (since 7d)
data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 217.31k objects, 836 GiB
usage: 2.4 TiB used, 3.0 TiB / 5.5 TiB avail
pgs: 193 active+clean
io:
client: 1.2 KiB/s rd, 40 KiB/s wr, 2 op/s rd, 5 op/s wr
Environment:
- OS (e.g. from /etc/os-release):
Ubuntu 20.04.3 LTS - Kernel (e.g.
uname -a):Linux k8s-node-4 5.4.0-1046-raspi #50-Ubuntu SMP PREEMPT Thu Oct 28 05:32:10 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux - Cloud provider or hardware configuration:
Raspberry PI 4 8GB for the node that was running this osd pod - Rook version (use
rook versioninside of a Rook Pod):
rook: v1.7.8
go: go1.16.7
- Storage backend version (e.g. for ceph do
ceph -v):ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) - Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:41:42Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/arm64"}
- Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
kubeadm - Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):HEALTH_OK
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (8 by maintainers)
Ah is that what this is? I was wondering why it was only some osds when I started to notice it. They eventually stopped restarting on my end
Saw this on a prod cluster today. Seems like the upgrade from nautilus -> octopus causes the osd to fsck for a while. Long enough the liveness probe kicks in and shoots it before its done recovering. then it restarts and tries all over again for a while, then reaches a crashloop and never recovers. It may be a good idea to avoid liveness checks during upgrades.
@travisn Yes. Thanks for the quick response. Something was causing the OSD to take a long time to startup. I tried lengthening the time the startup probe had, but I didn’t make it long enough. After disabling the probe the cluster went back to normal. OSD seems to be starting up quickly again now.
The error about the command not being found is certainly unexpected. You didn’t modify the liveness probe on the pod, right?
If you are seeing issues with the liveness probes, they can be disabled as described here, though that shouldn’t generally be necessary.