rook: Overriding ceph cluster livenessProbe should not require all fields

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: Rook operator ignores livenessProbe cluster settings.

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
...
  healthCheck:
    livenessProbe:
      mon:
        timeoutSeconds: 5
        failureThreshold: 5
      mgr:
        timeoutSeconds: 5
        failureThreshold: 5
      osd:
        timeoutSeconds: 5
        failureThreshold: 5

2021-01-14 20:34:33.331006 I | ceph-cluster-controller: CR has changed for "rook-ceph". diff=  v1.ClusterSpec{
  	... // 19 identical fields
  	RemoveOSDsIfOutAndSafeToRemove: false,
  	CleanupPolicy:                  {},
  	HealthCheck: v1.CephClusterHealthCheckSpec{
  		DaemonHealth:  {Status: {Interval: "60s"}, Monitor: {Interval: "45s"}, ObjectStorageDaemon: {Interval: "60s"}},
- 		LivenessProbe: nil,
+ 		LivenessProbe: map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
  	},
  	Security:     {},
  	LogCollector: {},
  }

Expected behavior: I expect that the cepth cluster is respecting my livenessProbe overrides and propagate them to the pods…

How to reproduce it (minimal and precise):

Simply add the cluster healthCheck settings above. Rook is deployed by official helm chart.

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary
Operator’s logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read Github documentation if you need help.

Environment:

OS (e.g. from /etc/os-release): Ubuntu 20.04.1 LTS
Kernel (e.g. uname -a): 5.4.0-1036-azure
Cloud provider or hardware configuration: Hyper-V gen 2 vm
Rook version (use rook version inside of a Rook Pod): v1.5.4 and v1.5.5
Storage backend version (e.g. for ceph do ceph -v): v15.2.8
Kubernetes version (use kubectl version): v1.20.0+k3s2
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): k3s (1 master + 3 worker)
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_WARN:2 daemons have recently crashed

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (11 by maintainers)

Most upvoted comments

@2fst4u there was a problem with the syntax missing probe, I tried with the following it is working as expected

healthCheck:
    ---
    # Change pod liveness probe, it works for all mon,mgr,osd daemons
    livenessProbe:
      mon:
        probe:
          initialDelaySeconds: 8
          timeoutSeconds: 3
          periodSeconds: 0
          successThreshold: 0
          failureThreshold: 5
      mgr:
        disabled: false
      osd:
        disabled: false

Liveness:       exec [env -i sh -c ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status] delay=8s timeout=3s period=10s #success=1 #failure=5

subhamkrai on May 5, 2021

I’m thinking the liveness probe spec in the CR wouldn’t change. After Rook deserializes the Probe struct, if any fields are empty or 0, Rook could set its defaults on those individual fields.

travisn on Jan 15, 2021

It would be nice to allow overriding the seconds and thresholds only.

Right, that should be very doable. @leseb?

travisn on Jan 14, 2021