rook: MGR Liveness Probe fails
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: mgr pod get constantly restarted
Expected behavior: no pod restart
How to reproduce it (minimal and precise): install rook on OpenShift cluster common.yaml operator-openshift.yaml cluster.yaml
Environment:
- OS (e.g. from /etc/os-release): CentOS 7.6
- Kernel (e.g.
uname -a): 3.10.0-957.21.3.el7.x86_64 - Cloud provider or hardware configuration: VMware VMs
- Rook version (use
rook versioninside of a Rook Pod): rook: v1.0.0-154.g004f795 - Kubernetes version (use
kubectl version): Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.0”, GitCommit:“e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529”, GitTreeState:“clean”, BuildDate:“2019-06-20T04:49:16Z”, GoVersion:“go1.12.6”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“11+”, GitVersion:“v1.11.0+d4cacc0”, GitCommit:“d4cacc0”, GitTreeState:“clean”, BuildDate:“2019-06-20T16:29:27Z”, GoVersion:“go1.10.8”, Compiler:“gc”, Platform:“linux/amd64”} - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): OpenShift 3.11
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox): HEALTH_WARN no active mgr
Additional mgr pod description:
Name: rook-ceph-mgr-a-6fb587d789-hwwdj
Namespace: rook-ceph
Priority: 0
PriorityClassName: <none>
Node: node-2/10.152.140.15
Start Time: Fri, 28 Jun 2019 08:08:56 +0200
Labels: app=rook-ceph-mgr
ceph_daemon_id=a
instance=a
mgr=a
pod-template-hash=2961438345
rook_cluster=rook-ceph
Annotations: openshift.io/scc=rook-ceph
Status: Running
IP: 10.152.140.15
Controlled By: ReplicaSet/rook-ceph-mgr-a-6fb587d789
Containers:
mgr:
Container ID: docker://079b57807846354ce0e9235a0ef5964499ff22e309c4a15cf6f190aa937fc843
Image: ceph/ceph:v14.2.1-20190430
Image ID: docker-pullable://docker.io/ceph/ceph@sha256:0d870d99a67ebc9a38c4855172f16e7f27a1b5d67945f056a88dce3bb99b2a29
Ports: 6800/TCP, 9283/TCP, 8443/TCP
Host Ports: 6800/TCP, 9283/TCP, 8443/TCP
Command:
ceph-mgr
Args:
--fsid=16d88549-ca72-4881-8509-f40b26e82fd4
--keyring=/etc/ceph/keyring-store/keyring
--log-to-stderr=true
--err-to-stderr=true
--mon-cluster-log-to-stderr=true
--log-stderr-prefix=debug
--default-log-to-file=false
--default-mon-cluster-log-to-file=false
--mon-host=$(ROOK_CEPH_MON_HOST)
--mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
--id=a
--foreground
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 28 Jun 2019 08:22:03 +0200
Finished: Fri, 28 Jun 2019 08:23:32 +0200
Ready: False
Restart Count: 7
Limits:
cpu: 500m
memory: 1Gi
Requests:
cpu: 500m
memory: 1Gi
Liveness: http-get http://:9283/ delay=60s timeout=1s period=10s #success=1 #failure=3
Environment:
CONTAINER_IMAGE: ceph/ceph:v14.2.1-20190430
POD_NAME: rook-ceph-mgr-a-6fb587d789-hwwdj (v1:metadata.name)
POD_NAMESPACE: rook-ceph (v1:metadata.namespace)
NODE_NAME: (v1:spec.nodeName)
POD_MEMORY_LIMIT: 1073741824 (limits.memory)
POD_MEMORY_REQUEST: 1073741824 (requests.memory)
POD_CPU_LIMIT: 1 (limits.cpu)
POD_CPU_REQUEST: 1 (requests.cpu)
ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false
ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false
ROOK_OPERATOR_NAMESPACE: rook-ceph
ROOK_CEPH_CLUSTER_CRD_VERSION: v1
ROOK_VERSION: v1.0.0-154.g004f795
ROOK_CEPH_CLUSTER_CRD_NAME: rook-ceph
Mounts:
/etc/ceph from rook-ceph-config (ro)
/etc/ceph/keyring-store/ from rook-ceph-mgr-a-keyring (ro)
/var/lib/ceph/mgr/ceph-a from ceph-daemon-data (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from rook-ceph-mgr-token-8zw5z (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
rook-ceph-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rook-ceph-config
Optional: false
rook-ceph-mgr-a-keyring:
Type: Secret (a volume populated by a Secret)
SecretName: rook-ceph-mgr-a-keyring
Optional: false
rook-ceph-log:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/rook-ceph/log
HostPathType:
ceph-daemon-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
rook-ceph-mgr-token-8zw5z:
Type: Secret (a volume populated by a Secret)
SecretName: rook-ceph-mgr-token-8zw5z
Optional: false
QoS Class: Guaranteed
Node-Selectors: node-role.kubernetes.io/compute=true
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned rook-ceph/rook-ceph-mgr-a-6fb587d789-hwwdj to node-2
Normal Pulled 14m (x4 over 18m) kubelet, node-2 Container image "ceph/ceph:v14.2.1-20190430" already present on machine
Normal Created 14m (x4 over 18m) kubelet, node-2 Created container
Normal Killing 14m (x3 over 17m) kubelet, node-2 Killing container with id docker://mgr:Container failed liveness probe.. Container will be killed and recreated.
Normal Started 14m (x4 over 18m) kubelet, node-2 Started container
Warning Unhealthy 8m (x19 over 17m) kubelet, node-2 Liveness probe failed: HTTP probe failed with statuscode: 403
Warning BackOff 3m (x17 over 8m) kubelet, node-2 Back-off restarting failed container
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 32 (12 by maintainers)
I have the same issue. I use rook-ceph (pretty much as-is settings) on k8s and I had one mgr pod. Since it crashed, it’s not able to come up again and it’s not even possible to run “ceph status” on the operator.
In my case it seems to have related to smaller arp table in linux. When there are high number of drives per server, which causes high number of containers per server, the arp table can exhaust creating a liveness probe failure.
After applying following changes to sysctl, the issue did not happen.
Do check the ping response/latency for mgr and you may also want to increase mgr liveness time out Edit
deploy/rook-ceph-mgr-aand set livenessProbe: timeoutSeconds: 9afaik not
also I have to problem to access the mgr (after disabled the liveness check) from my browser per oc port-forward svc/rook-ceph-mgr 9283 Forwarding from 127.0.0.1:9283 -> 9283 Forwarding from [::1]:9283 -> 9283 Handling connection for 9283
Ths mgr liveness probe is being updated in #8721 so it should also be independent of the network config and thus be more reliable.
Hope I’m not writing this too soon…
I switched from Weavenet to Calico CNI. And rook-ceph came up much quicker…and appears to be much more stable. Not sure if I just didn’t have Weavenet configured incorrectly, or if Calico is much more stable.
Previously I had noticed mon pods coming up very slowly, mgr pod failing (as shown above), and other instabilities. So far much more stable.
Here is my mgr logs which fails and reproducible of this bug
I think the ssl error can be ignored.
Another observation, the bug does not occur for small number of OSDs (probably < 10) but becomes prominent after > 100.