rook: Manager Pod Dies in Cluster with 10 OSDs per node

Is this a bug report or feature request?

  • Bug Report Deviation from expected behavior:

How to reproduce it (minimal and precise): In my cluster I am seeing my manager pod stop working but remaining in running state. After which, ceph becomes unhealthy until I manually restart the manager pod. I have a cluster with 15 nodes with 2 nvmes each that I split into 10 osds per node(5 per device). I have am using my journal to my osdisk with bluestore. Maybe I need to increase resource allocation to manager pod due to so many osds or something? Haven’t found anything in the logs yet.

Mons also seem to have issues in a similar way. Any idea why? Maybe resources need to scale with osds per node?

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 18.04
  • Kernel (e.g. uname -a): 4.18.0-1013-azure
  • Cloud provider or hardware configuration: Azure Standard_L16s_v2
  • Rook version (use rook version inside of a Rook Pod): rook: v0.9.2
  • Kubernetes version (use kubectl version): 1.14.0
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Kubespray in Azure
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): [root@minion-12 /]# ceph health HEALTH_WARN 25 slow ops, oldest one blocked for 37503 sec, daemons [mon.et,mon.gr,mon.gu] have slow ops.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

@travisn If the mgr is down we won’t have any PG tracking. We need to logs of the pods dying in this case. @sharkymcdongles can you please provide that?