rook: MGR Liveness Probe fails

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: mgr pod get constantly restarted

Expected behavior: no pod restart

How to reproduce it (minimal and precise): install rook on OpenShift cluster common.yaml operator-openshift.yaml cluster.yaml

Environment:

OS (e.g. from /etc/os-release): CentOS 7.6
Kernel (e.g. uname -a): 3.10.0-957.21.3.el7.x86_64
Cloud provider or hardware configuration: VMware VMs
Rook version (use rook version inside of a Rook Pod): rook: v1.0.0-154.g004f795
Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.0”, GitCommit:“e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529”, GitTreeState:“clean”, BuildDate:“2019-06-20T04:49:16Z”, GoVersion:“go1.12.6”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“11+”, GitVersion:“v1.11.0+d4cacc0”, GitCommit:“d4cacc0”, GitTreeState:“clean”, BuildDate:“2019-06-20T16:29:27Z”, GoVersion:“go1.10.8”, Compiler:“gc”, Platform:“linux/amd64”}
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): OpenShift 3.11
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_WARN no active mgr

Additional mgr pod description:

Name:               rook-ceph-mgr-a-6fb587d789-hwwdj
Namespace:          rook-ceph
Priority:           0
PriorityClassName:  <none>
Node:               node-2/10.152.140.15
Start Time:         Fri, 28 Jun 2019 08:08:56 +0200
Labels:             app=rook-ceph-mgr
                    ceph_daemon_id=a
                    instance=a
                    mgr=a
                    pod-template-hash=2961438345
                    rook_cluster=rook-ceph
Annotations:        openshift.io/scc=rook-ceph
Status:             Running
IP:                 10.152.140.15
Controlled By:      ReplicaSet/rook-ceph-mgr-a-6fb587d789
Containers:
  mgr:
    Container ID:  docker://079b57807846354ce0e9235a0ef5964499ff22e309c4a15cf6f190aa937fc843
    Image:         ceph/ceph:v14.2.1-20190430
    Image ID:      docker-pullable://docker.io/ceph/ceph@sha256:0d870d99a67ebc9a38c4855172f16e7f27a1b5d67945f056a88dce3bb99b2a29
    Ports:         6800/TCP, 9283/TCP, 8443/TCP
    Host Ports:    6800/TCP, 9283/TCP, 8443/TCP
    Command:
      ceph-mgr
    Args:
      --fsid=16d88549-ca72-4881-8509-f40b26e82fd4
      --keyring=/etc/ceph/keyring-store/keyring
      --log-to-stderr=true
      --err-to-stderr=true
      --mon-cluster-log-to-stderr=true
      --log-stderr-prefix=debug
      --default-log-to-file=false
      --default-mon-cluster-log-to-file=false
      --mon-host=$(ROOK_CEPH_MON_HOST)
      --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
      --id=a
      --foreground
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 28 Jun 2019 08:22:03 +0200
      Finished:     Fri, 28 Jun 2019 08:23:32 +0200
    Ready:          False
    Restart Count:  7
    Limits:
      cpu:     500m
      memory:  1Gi
    Requests:
      cpu:     500m
      memory:  1Gi
    Liveness:  http-get http://:9283/ delay=60s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CONTAINER_IMAGE:                ceph/ceph:v14.2.1-20190430
      POD_NAME:                       rook-ceph-mgr-a-6fb587d789-hwwdj (v1:metadata.name)
      POD_NAMESPACE:                  rook-ceph (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               1073741824 (limits.memory)
      POD_MEMORY_REQUEST:             1073741824 (requests.memory)
      POD_CPU_LIMIT:                  1 (limits.cpu)
      POD_CPU_REQUEST:                1 (requests.cpu)
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
      ROOK_OPERATOR_NAMESPACE:        rook-ceph
      ROOK_CEPH_CLUSTER_CRD_VERSION:  v1
      ROOK_VERSION:                   v1.0.0-154.g004f795
      ROOK_CEPH_CLUSTER_CRD_NAME:     rook-ceph
    Mounts:
      /etc/ceph from rook-ceph-config (ro)
      /etc/ceph/keyring-store/ from rook-ceph-mgr-a-keyring (ro)
      /var/lib/ceph/mgr/ceph-a from ceph-daemon-data (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from rook-ceph-mgr-token-8zw5z (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  rook-ceph-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rook-ceph-config
    Optional:  false
  rook-ceph-mgr-a-keyring:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-mgr-a-keyring
    Optional:    false
  rook-ceph-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/rook-ceph/log
    HostPathType:
  ceph-daemon-data:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  rook-ceph-mgr-token-8zw5z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-mgr-token-8zw5z
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  node-role.kubernetes.io/compute=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason     Age                From                               Message
  ----     ------     ----               ----                               -------
  Normal   Scheduled  18m                default-scheduler                  Successfully assigned rook-ceph/rook-ceph-mgr-a-6fb587d789-hwwdj to node-2
  Normal   Pulled     14m (x4 over 18m)  kubelet, node-2  Container image "ceph/ceph:v14.2.1-20190430" already present on machine
  Normal   Created    14m (x4 over 18m)  kubelet, node-2  Created container
  Normal   Killing    14m (x3 over 17m)  kubelet, node-2  Killing container with id docker://mgr:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Started    14m (x4 over 18m)  kubelet, node-2  Started container
  Warning  Unhealthy  8m (x19 over 17m)  kubelet, node-2  Liveness probe failed: HTTP probe failed with statuscode: 403
  Warning  BackOff    3m (x17 over 8m)   kubelet, node-2  Back-off restarting failed container

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 32 (12 by maintainers)

Commits related to this issue

Remove liveness checks that fail in slow clusters In our CI environment the liveness tests are failing due to taking too long. We should increase them. See https://github.com/rook/rook/issues/3370 an... — committed to jhesketh/rookcheck by jhesketh 3 years ago

Most upvoted comments

I have the same issue. I use rook-ceph (pretty much as-is settings) on k8s and I had one mgr pod. Since it crashed, it’s not able to come up again and it’s not even possible to run “ceph status” on the operator.

smirnov-mi on Aug 18, 2019

In my case it seems to have related to smaller arp table in linux. When there are high number of drives per server, which causes high number of containers per server, the arp table can exhaust creating a liveness probe failure.

After applying following changes to sysctl, the issue did not happen.

net.ipv4.neigh.default.gc_thresh1 = 8192
net.ipv4.neigh.default.gc_thresh2 = 32768
net.ipv4.neigh.default.gc_thresh3 = 65536

Do check the ping response/latency for mgr and you may also want to increase mgr liveness time out Edit deploy/rook-ceph-mgr-a and set livenessProbe: timeoutSeconds: 9

rushikeshjadhav on Dec 11, 2019

afaik not

also I have to problem to access the mgr (after disabled the liveness check) from my browser per oc port-forward svc/rook-ceph-mgr 9283 Forwarding from 127.0.0.1:9283 -> 9283 Forwarding from [::1]:9283 -> 9283 Handling connection for 9283

Strange-Account on Jun 28, 2019

Ths mgr liveness probe is being updated in #8721 so it should also be independent of the network config and thus be more reliable.

travisn on Sep 15, 2021

Hope I’m not writing this too soon…

I switched from Weavenet to Calico CNI. And rook-ceph came up much quicker…and appears to be much more stable. Not sure if I just didn’t have Weavenet configured incorrectly, or if Calico is much more stable.

Previously I had noticed mon pods coming up very slowly, mgr pod failing (as shown above), and other instabilities. So far much more stable.

moonlight16 on Sep 15, 2021

Here is my mgr logs which fails and reproducible of this bug

debug 2019-11-09 08:08:11.913 7f0335a80b80  0 set uid:gid to 167:167 (ceph:ceph)
debug 2019-11-09 08:08:11.914 7f0335a80b80  0 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable), process ceph-
mgr, pid 1
debug 2019-11-09 08:08:11.914 7f0335a80b80  0 pidfile_write: ignore empty --pid-file
debug 2019-11-09 08:08:11.996 7f0335a80b80  1 mgr[py] Loading python module 'ansible'
debug 2019-11-09 08:08:12.165 7f0335a80b80  1 mgr[py] Loading python module 'balancer'
debug 2019-11-09 08:08:12.185 7f0335a80b80  1 mgr[py] Loading python module 'crash'
debug 2019-11-09 08:08:12.200 7f0335a80b80  1 mgr[py] Loading python module 'dashboard'
debug 2019-11-09 08:08:12.426 7f0335a80b80  1 mgr[py] Loading python module 'deepsea'
debug 2019-11-09 08:08:12.595 7f0335a80b80  1 mgr[py] Loading python module 'devicehealth'
debug 2019-11-09 08:08:12.612 7f0335a80b80  1 mgr[py] Loading python module 'diskprediction_local'
debug 2019-11-09 08:08:12.629 7f0335a80b80  1 mgr[py] Loading python module 'influx'
debug 2019-11-09 08:08:12.646 7f0335a80b80  1 mgr[py] Loading python module 'insights'
debug 2019-11-09 08:08:12.661 7f0335a80b80  1 mgr[py] Loading python module 'iostat'
debug 2019-11-09 08:08:12.677 7f0335a80b80  1 mgr[py] Loading python module 'localpool'
debug 2019-11-09 08:08:12.692 7f0335a80b80  1 mgr[py] Loading python module 'orchestrator_cli'
debug 2019-11-09 08:08:12.752 7f0335a80b80  1 mgr[py] Loading python module 'pg_autoscaler'
debug 2019-11-09 08:08:12.858 7f0335a80b80  1 mgr[py] Loading python module 'progress'
debug 2019-11-09 08:08:12.918 7f0335a80b80  1 mgr[py] Loading python module 'prometheus'
debug 2019-11-09 08:08:13.133 7f0335a80b80  1 mgr[py] Loading python module 'rbd_support'
debug 2019-11-09 08:08:13.196 7f0335a80b80  1 mgr[py] Loading python module 'restful'
debug 2019-11-09 08:08:13.443 7f0335a80b80  1 mgr[py] Loading python module 'rook'
debug 2019-11-09 08:08:13.920 7f0335a80b80  1 mgr[py] Loading python module 'selftest'
debug 2019-11-09 08:08:13.939 7f0335a80b80  1 mgr[py] Loading python module 'status'
debug 2019-11-09 08:08:13.969 7f0335a80b80  1 mgr[py] Loading python module 'telegraf'
debug 2019-11-09 08:08:13.998 7f0335a80b80  1 mgr[py] Loading python module 'telemetry'
debug 2019-11-09 08:08:14.233 7f0335a80b80  1 mgr[py] Loading python module 'test_orchestrator'
debug 2019-11-09 08:08:14.341 7f0335a80b80  1 mgr[py] Loading python module 'volumes'
debug 2019-11-09 08:08:14.434 7f0335a80b80  1 mgr[py] Loading python module 'zabbix'
debug 2019-11-09 08:08:14.468 7f0321031700  0 ms_deliver_dispatch: unhandled message 0x558a348aaa00 mon_map magic: 0 v1 from mon.1 v2:10.110.
206.237:3300/0
debug 2019-11-09 08:08:15.231 7f0321031700  1 mgr handle_mgr_map Activating!
debug 2019-11-09 08:08:15.232 7f0321031700  1 mgr handle_mgr_map I am now activating
debug 2019-11-09 08:08:15.341 7f030f796700  0 ms_deliver_dispatch: unhandled message 0x558a3476e300 mgrreport(osd.61 +62-0 packed 822 daemon_
metrics=2) v7 from osd.61 v2:10.32.0.39:6800/104860
debug 2019-11-09 08:08:15.360 7f030f796700  0 ms_deliver_dispatch: unhandled message 0x558a3476e600 mgrreport(osd.128 +62-0 packed 822 daemon
_metrics=2) v7 from osd.128 v2:10.32.0.49:6800/176699
debug 2019-11-09 08:08:15.363 7f030f796700  0 ms_deliver_dispatch: unhandled message 0x558a3476e900 mgrreport(osd.128 +62-0 packed 822) v7 fr
om osd.128 v2:10.32.0.49:6800/176699
debug 2019-11-09 08:08:15.389 7f030f796700  0 ms_deliver_dispatch: unhandled message 0x558a3476e300 mgrreport(osd.16 +62-0 packed 822 daemon_
metrics=2) v7 from osd.16 v2:10.38.0.9:6800/102052
debug 2019-11-09 08:08:15.511 7f030ff97700  1 mgr load Constructed class from module: balancer
debug 2019-11-09 08:08:15.511 7f030ff97700  1 mgr load Constructed class from module: crash
debug 2019-11-09 08:08:15.514 7f030ff97700  1 mgr load Constructed class from module: dashboard
debug 2019-11-09 08:08:15.515 7f030ff97700  1 mgr load Constructed class from module: devicehealth
debug 2019-11-09 08:08:15.520 7f030ff97700  1 mgr load Constructed class from module: iostat
debug 2019-11-09 08:08:15.522 7f030ff97700  1 mgr load Constructed class from module: orchestrator_cli
debug 2019-11-09 08:08:15.523 7f030ff97700  1 mgr load Constructed class from module: pg_autoscaler
debug 2019-11-09 08:08:15.526 7f030ff97700  1 mgr load Constructed class from module: progress
debug 2019-11-09 08:08:15.528 7f030ff97700  1 mgr load Constructed class from module: prometheus

[09/Nov/2019:08:08:15] ENGINE Bus STARTING
CherryPy Checker:
The Application mounted at '' has an empty config.

[09/Nov/2019:08:08:15] ENGINE Started monitor thread '_TimeoutMonitor'.
debug 2019-11-09 08:08:15.768 7f030ff97700  1 mgr load Constructed class from module: rbd_support
debug 2019-11-09 08:08:15.769 7f030ff97700  1 mgr load Constructed class from module: restful
debug 2019-11-09 08:08:15.770 7f030ff97700  1 mgr load Constructed class from module: rook
debug 2019-11-09 08:08:15.771 7f030ff97700  1 mgr load Constructed class from module: status
debug 2019-11-09 08:08:15.774 7f02fd572700  1 mgr[restful] server not running: no certificate configured
debug 2019-11-09 08:08:15.782 7f030ff97700  1 mgr load Constructed class from module: volumes
[09/Nov/2019:08:08:15] ENGINE Serving on 0.0.0.0:9283
[09/Nov/2019:08:08:15] ENGINE Bus STARTED
debug 2019-11-09 08:08:15.880 7f02fad6d700 -1 client.0 error registering admin socket command: (17) File exists
debug 2019-11-09 08:08:15.880 7f02fad6d700 -1 client.0 error registering admin socket command: (17) File exists
debug 2019-11-09 08:08:15.880 7f02fad6d700 -1 client.0 error registering admin socket command: (17) File exists
debug 2019-11-09 08:08:15.880 7f02fad6d700 -1 client.0 error registering admin socket command: (17) File exists
debug 2019-11-09 08:08:15.880 7f02fad6d700 -1 client.0 error registering admin socket command: (17) File exists
[09/Nov/2019:08:08:16] ENGINE Bus STARTING
[09/Nov/2019:08:08:16] ENGINE Started monitor thread '_TimeoutMonitor'.
[09/Nov/2019:08:08:16] ENGINE Serving on 0.0.0.0:8443
[09/Nov/2019:08:08:16] ENGINE Bus STARTED
[09/Nov/2019:08:08:17] ENGINE Error in HTTPServer.tick
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cherrypy/wsgiserver/wsgiserver2.py", line 1837, in start
    self.tick()
  File "/usr/lib/python2.7/site-packages/cherrypy/wsgiserver/wsgiserver2.py", line 1902, in tick
    s, ssl_env = self.ssl_adapter.wrap(s)
  File "/usr/lib/python2.7/site-packages/cherrypy/wsgiserver/ssl_builtin.py", line 52, in wrap
    keyfile=self.private_key, ssl_version=ssl.PROTOCOL_SSLv23)
  File "/usr/lib64/python2.7/ssl.py", line 934, in wrap_socket
    ciphers=ciphers)
  File "/usr/lib64/python2.7/ssl.py", line 609, in __init__
    self.do_handshake()
  File "/usr/lib64/python2.7/ssl.py", line 831, in do_handshake
    self._sslobj.do_handshake()
SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:618)

I think the ssl error can be ignored.

Another observation, the bug does not occur for small number of OSDs (probably < 10) but becomes prominent after > 100.

rushikeshjadhav on Nov 9, 2019