rook: Manager Pod Dies in Cluster with 10 OSDs per node
Is this a bug report or feature request?
- Bug Report Deviation from expected behavior:
How to reproduce it (minimal and precise): In my cluster I am seeing my manager pod stop working but remaining in running state. After which, ceph becomes unhealthy until I manually restart the manager pod. I have a cluster with 15 nodes with 2 nvmes each that I split into 10 osds per node(5 per device). I have am using my journal to my osdisk with bluestore. Maybe I need to increase resource allocation to manager pod due to so many osds or something? Haven’t found anything in the logs yet.
Mons also seem to have issues in a similar way. Any idea why? Maybe resources need to scale with osds per node?
Environment:
- OS (e.g. from /etc/os-release): Ubuntu 18.04
- Kernel (e.g.
uname -a): 4.18.0-1013-azure - Cloud provider or hardware configuration: Azure Standard_L16s_v2
- Rook version (use
rook versioninside of a Rook Pod): rook: v0.9.2 - Kubernetes version (use
kubectl version): 1.14.0 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Kubespray in Azure
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox): [root@minion-12 /]# ceph health HEALTH_WARN 25 slow ops, oldest one blocked for 37503 sec, daemons [mon.et,mon.gr,mon.gu] have slow ops.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (11 by maintainers)
@travisn If the mgr is down we won’t have any PG tracking. We need to logs of the pods dying in this case. @sharkymcdongles can you please provide that?