rook: Liveness probe failed: no valid command found; 10 closest matches:

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: Liveness probe seems to be attempting to execute something that doesn’t exist

Expected behavior: Liveness probe execution completes

How to reproduce it (minimal and precise):

Possibly related – Put enough disk pressure on nodes running the etcd containers to cause HA cluster to round-robin etcd responsibilities between control-plane nodes. Otherwise, the node running the osd pod might have been under resource pressure (cpu at least). File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary
Operator’s logs, if necessary
Crashing pod(s) logs, if necessary events from k -n rook-ceph describe pod rook-ceph-osd-0-6b48dcbdc5-hp2pr

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  14m   kubelet  Liveness probe failed: no valid command found; 10 closest matches:
0
1
2
abort
assert
bluefs debug_inject_read_zeros
bluefs files list
bluefs stats
bluestore allocator dump block
bluestore allocator fragmentation block
admin_socket: invalid command
  Normal   Pulled     11m (x3 over 15m)      kubelet  Container image "quay.io/ceph/ceph:v16.2.6" already present on machine
  Normal   Created    11m (x3 over 15m)      kubelet  Created container osd
  Normal   Started    11m (x3 over 15m)      kubelet  Started container osd
  Normal   Killing    10m (x4 over 15m)      kubelet  Container osd failed liveness probe, will be restarted
  Warning  Unhealthy  9m48s (x21 over 2d8h)  kubelet  Liveness probe failed:
  Warning  BackOff    25s (x28 over 6m57s)   kubelet  Back-off restarting failed container

Oddly, when I attempted to run the liveness probe myself, I did not get that particular error. Did I run a different liveness probe?

k -n rook-ceph exec rook-ceph-osd-0-6b48dcbdc5-hp2pr --container osd -it -- env -i sh -c ceph --admin-daemon /ron/ceph/ceph-osd.0.asok status
unable to get monitor info from DNS SRV with service name: ceph-mon
2021-12-01T08:12:56.464+0000 ffff7d3fa1e0 -1 failed for service _ceph-mon._tcp
2021-12-01T08:12:56.464+0000 ffff7d3fa1e0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
[errno 2] RADOS object not found (error connecting to the cluster)
command terminated with exit code 1

I eventually removed the pods that were hammering the node and deleted this pod to have k8s re-create since it was in “back-off” mode. The re-created pod came up without any issues, but I thought I’d at least report this.

Re-running the liveness probe on the newly created pod still gives this though:

k -n rook-ceph exec rook-ceph-osd-0-6b48dcbdc5-lb9qm --container osd -it -- env -i sh -c ceph --admin-daemon /ron/ceph/ceph-osd.0.asok status
unable to get monitor info from DNS SRV with service name: ceph-mon
2021-12-01T08:51:06.503+0000 ffff9193e1e0 -1 failed for service _ceph-mon._tcp
2021-12-01T08:51:06.507+0000 ffff9193e1e0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
[errno 2] RADOS object not found (error connecting to the cluster)
command terminated with exit code 1

Which is odd since ceph status shows that everything is fine:

  cluster:
    id:     3d957ef3-6713-4032-9aa3-88958dd2cb5f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum p,q,r (age 33m)
    mgr: a(active, since 2d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 30m), 6 in (since 7d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 217.31k objects, 836 GiB
    usage:   2.4 TiB used, 3.0 TiB / 5.5 TiB avail
    pgs:     193 active+clean

  io:
    client:   1.2 KiB/s rd, 40 KiB/s wr, 2 op/s rd, 5 op/s wr

Environment:

OS (e.g. from /etc/os-release): Ubuntu 20.04.3 LTS
Kernel (e.g. uname -a): Linux k8s-node-4 5.4.0-1046-raspi #50-Ubuntu SMP PREEMPT Thu Oct 28 05:32:10 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
Cloud provider or hardware configuration: Raspberry PI 4 8GB for the node that was running this osd pod
Rook version (use rook version inside of a Rook Pod):

rook: v1.7.8
go: go1.16.7

Storage backend version (e.g. for ceph do ceph -v): ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:41:42Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/arm64"}

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 25 (8 by maintainers)

Most upvoted comments

Saw this on a prod cluster today. Seems like the upgrade from nautilus -> octopus causes the osd to fsck for a while. Long enough the liveness probe kicks in and shoots it before its done recovering. then it restarts and tries all over again for a while, then reaches a crashloop and never recovers. It may be a good idea to avoid liveness checks during upgrades.

Ah is that what this is? I was wondering why it was only some osds when I started to notice it. They eventually stopped restarting on my end

rook-ceph-osd-4-5bc9b45689-spfxw                    1/1     Running                 1 (10m ago)     12m   10.0.5.35       store-1                <none>           <none>
rook-ceph-osd-5-688d4c587d-pwvqm                    1/1     Running                 3 (6m37s ago)   11m   10.0.5.163      store-1                <none>           <none>
rook-ceph-osd-6-789747d589-ls4pw                    1/1     Running                 1 (9m38s ago)   11m   10.0.5.59       store-1                <none>           <none>
rook-ceph-osd-7-74fd574d67-kkrrn                    1/1     Running                 3 (6m2s ago)    11m   10.0.5.58       store-1                <none>           <none>

sfxworks on Mar 21, 2023

Saw this on a prod cluster today. Seems like the upgrade from nautilus -> octopus causes the osd to fsck for a while. Long enough the liveness probe kicks in and shoots it before its done recovering. then it restarts and tries all over again for a while, then reaches a crashloop and never recovers. It may be a good idea to avoid liveness checks during upgrades.

kfox1111 on Aug 16, 2022

@travisn Yes. Thanks for the quick response. Something was causing the OSD to take a long time to startup. I tried lengthening the time the startup probe had, but I didn’t make it long enough. After disabling the probe the cluster went back to normal. OSD seems to be starting up quickly again now.

jmmaloney4 on Feb 15, 2022

The error about the command not being found is certainly unexpected. You didn’t modify the liveness probe on the pod, right?

If you are seeing issues with the liveness probes, they can be disabled as described here, though that shouldn’t generally be necessary.

travisn on Dec 1, 2021