rook: mgr pod in CrashLoop in 0.8.x

Is this a bug report or feature request? Bug Report

Deviation from expected behavior: After rescheduling the mgr pod, it goes into a CrashLoop with the following:

2018-09-30 19:26:48.956022 I | ceph-mgr: 2018-09-30 19:26:48.955809 7f0adebf4700  1 mgr send_beacon active
2018-09-30 19:26:50.970860 I | ceph-mgr: 2018-09-30 19:26:50.970649 7f0adebf4700  1 mgr send_beacon active
2018-09-30 19:26:52.985827 I | ceph-mgr: 2018-09-30 19:26:52.985611 7f0adebf4700  1 mgr send_beacon active
2018-09-30 19:26:54.004538 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Bus STARTING
2018-09-30 19:26:54.004566 I | ceph-mgr: CherryPy Checker:
2018-09-30 19:26:54.004575 I | ceph-mgr: The Application mounted at '' has an empty config.
2018-09-30 19:26:54.004581 I | ceph-mgr: 
2018-09-30 19:26:54.004588 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Started monitor thread '_TimeoutMonitor'.
2018-09-30 19:26:54.004594 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Bus STARTING
2018-09-30 19:26:54.004600 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Started monitor thread '_TimeoutMonitor'.
2018-09-30 19:26:54.004606 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Serving on :::7000
2018-09-30 19:26:54.004611 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Bus STARTED
2018-09-30 19:26:54.004624 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Serving on :::9283
2018-09-30 19:26:54.004630 I | ceph-mgr: [30/Sep/2018:19:26:47] ENGINE Bus STARTED
2018-09-30 19:26:54.004636 I | ceph-mgr: terminate called after throwing an instance of 'std::out_of_range'
2018-09-30 19:26:54.004644 I | ceph-mgr:   what():  map::at
failed to run mgr. failed to start mgr: Failed to complete 'ceph-mgr': signal: aborted (core dumped).

Expected behavior: No crash loop 😉

How to reproduce it (minimal and precise): Personally I’ve experienced it in several test clusters, but haven’t had time to dig into it until tonight. Another user mentioned this a week ago on slack and @galexrt pointed to this bug in Ceph that seems to be related: https://tracker.ceph.com/issues/24982

In that issue, people mention multiple RGWs and I’m running RGWs as a daemonset for these clusters. So I tried to scale the amount of RGWs down to 1, using NodeAffinity, and nor the mgr was able to start up. After it’s started, I can scale the RGWs back up to full count (5 on testing cluster) in one go and the mgr stays up. Without knowing this in depth, it seems to me the RGWs are building up a history of metrics to deliver to the mgr when they can’t reach it. When the mgr starts again, these historic metrics overwhelms it and it gets startled, not to confuse with started.

Environment:

OS (e.g. from /etc/os-release): CoreOS
Kernel (e.g. uname -a): Something new
Cloud provider or hardware configuration: Bare metal
Rook version (use rook version inside of a Rook Pod): v0.8.1 - 99% sure I saw it on 0.8.2 as well, but had to vacate that due to a different bug.
Kubernetes version (use kubectl version): 1.11.3
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Kubespray
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): Seems happy, but no dashboard and metrics obviously.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 4
Comments: 52 (36 by maintainers)

Most upvoted comments

We made some progress in our deployment over the past few weeks. We’re currently running v0.8.3 (Ceph 12.2.7) and getting any more than 3 RGW’s in our cluster would cause our MGR to go into CrashLoopBackoff. Today we were able to get our MGR to be stable with more RGW’s by downgrading the MGR Deployment image to v0.8.0 (Ceph 12.2.4?) and leaving the rest of the cluster as v0.8.3 (Ceph 12.2.7). You can accomplish this with the following:

kubectl -n rook-ceph set image deploy/rook-ceph-mgr-a rook-ceph-mgr-a=rook/ceph:v0.8.0

patrickstjohn on Nov 5, 2018

Still an issue in rook 0.9, rgw & mgr v13.2.2-20181023

dimm0 on Dec 21, 2018

@travisn It’s also working on my cluster when I scale the rgw to 1

elthariel on Oct 10, 2018

Got it working with 1 RGW

dimm0 on Oct 9, 2018

Fixed in mimic 13.2.5 This can be closed The container finally made it to dockerhub

dimm0 on Mar 22, 2019

Any progress here?

dimm0 on Jan 19, 2019

@travisn Hmm… I modified the s3 objectstore operator to scale RGW to one, and it deleted the deployment and never recreated. Is there a way to create one again?

dimm0 on Oct 5, 2018

Is there a way to disable metrics for S3, and only have blockdev working? (like it worked before enabling s3)

dimm0 on Oct 4, 2018