rook: Mon pod gets into Init:CrashLoopBackOff after reboot

Hi there, I deployed a ceph cluster serveral days ago and it worked well, yesterday i reboot the instance to find the mon pod turned to Init:CrashLoopBackOff state, I edit the operator followed by https://github.com/rook/rook/issues/3485 but it did not works, any suggestions? Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

Expected behavior: mon pods turns to running state How to reproduce it (minimal and precise):

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

Operator’s logs, if necessary

...
2023-07-26 04:40:07.459289 I | op-mon: mon a is not yet running
2023-07-26 04:40:07.459348 I | op-mon: mons running: []
2023-07-26 04:40:07.459394 D | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-07-26 04:40:07.788130 D | ceph-object-controller: object store "rook-ceph/ceph-objectstore" status updated to "Progressing"
2023-07-26 04:40:07.788232 D | ceph-spec: "ceph-object-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:07.788261 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:07.788272 D | ceph-spec: "ceph-object-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:07.788296 D | ceph-object-controller: successfully configured CephObjectStore "rook-ceph/ceph-objectstore"
2023-07-26 04:40:11.853829 D | clusterdisruption-controller: ceph "rook-ceph" cluster failed to check cluster health. failed to get status. . timed out: exit status 1
2023-07-26 04:40:11.853942 D | clusterdisruption-controller: reconciling "rook-ceph/"
2023-07-26 04:40:11.854123 D | clusterdisruption-controller: could not match failure domain. defaulting to "host"
2023-07-26 04:40:11.854204 D | exec: Running command: ceph status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-07-26 04:40:16.996399 D | ceph-spec: "ceph-block-pool-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:16.996456 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:16.996464 D | ceph-spec: "ceph-block-pool-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:16.996478 D | ceph-block-pool-controller: successfully configured CephBlockPool "rook-ceph/ceph-blockpool"
2023-07-26 04:40:17.800489 D | ceph-object-controller: object store "rook-ceph/ceph-objectstore" status updated to "Progressing"
2023-07-26 04:40:17.800615 D | ceph-spec: "ceph-object-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:17.800676 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:17.800690 D | ceph-spec: "ceph-object-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:17.800739 D | ceph-object-controller: successfully configured CephObjectStore "rook-ceph/ceph-objectstore"
2023-07-26 04:40:23.455699 D | op-mon: failed to get quorum_status. mon quorum status failed: exit status 1
2023-07-26 04:40:26.997288 D | ceph-spec: "ceph-block-pool-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:26.997344 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:26.997352 D | ceph-spec: "ceph-block-pool-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:26.997365 D | ceph-block-pool-controller: successfully configured CephBlockPool "rook-ceph/ceph-blockpool"
2023-07-26 04:40:27.777168 D | clusterdisruption-controller: ceph "rook-ceph" cluster failed to check cluster health. failed to get status. . timed out: exit status 1
2023-07-26 04:40:27.777274 D | clusterdisruption-controller: reconciling "rook-ceph/rook-ceph"
2023-07-26 04:40:27.777505 D | clusterdisruption-controller: could not match failure domain. defaulting to "host"
2023-07-26 04:40:27.777594 D | exec: Running command: ceph status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-07-26 04:40:27.817158 D | ceph-object-controller: object store "rook-ceph/ceph-objectstore" status updated to "Progressing"
2023-07-26 04:40:27.817272 D | ceph-spec: "ceph-object-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:27.817309 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:27.817321 D | ceph-spec: "ceph-object-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:27.817344 D | ceph-object-controller: successfully configured CephObjectStore "rook-ceph/ceph-objectstore"
2023-07-26 04:40:28.468172 I | op-mon: mon a is not yet running
2023-07-26 04:40:28.468219 I | op-mon: mons running: []
2023-07-26 04:40:28.468254 D | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-07-26 04:40:36.998260 D | ceph-spec: "ceph-block-pool-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:36.998318 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:36.998327 D | ceph-spec: "ceph-block-pool-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:36.998340 D | ceph-block-pool-controller: successfully configured CephBlockPool "rook-ceph/ceph-blockpool"
2023-07-26 04:40:37.833455 D | ceph-object-controller: object store "rook-ceph/ceph-objectstore" status updated to "Progressing"
2023-07-26 04:40:37.833594 D | ceph-spec: "ceph-object-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:37.833635 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:37.833647 D | ceph-spec: "ceph-object-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:37.833670 D | ceph-object-controller: successfully configured CephObjectStore "rook-ceph/ceph-objectstore"
...

Crashing pod(s) logs, if necessary

# kubectl logs -f -n rook-ceph rook-ceph-mon-a-86b549b854-p8lnl chown-container-data-dir
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.7.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.6.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.4.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.1.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.5.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.1.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.4.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.2.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.2.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.3.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.5.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.2.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.3.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.6.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.5.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.5.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.4.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.2.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.6.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-volume.log' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.3.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.6.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.1.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.4.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.3.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.1.gz' retained as ceph:ceph
ownership of '/var/log/ceph' retained as ceph:ceph
ownership of '/var/lib/ceph/crash/posted' retained as ceph:ceph
ownership of '/var/lib/ceph/crash' retained as ceph:ceph
ownership of '/run/ceph/ceph-osd.0.asok' retained as ceph:ceph
ownership of '/run/ceph/ceph-mgr.a.asok' retained as ceph:ceph
ownership of '/run/ceph' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/external_log_to' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/kv_backend' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/keyring' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/min_mon_release' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/CURRENT' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/015298.log' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/MANIFEST-000009' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/OPTIONS-000012' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/015300.sst' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/OPTIONS-000006' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/IDENTITY' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/LOCK' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a' retained as ceph:ceph

# kubectl logs -f -n rook-ceph rook-ceph-mon-a-86b549b854-p8lnl init-mon-fs
debug 2023-07-26T04:37:11.189+0000 2b0188522c40  0 set uid:gid to 167:167 (ceph:ceph)
debug 2023-07-26T04:37:11.189+0000 2b0188522c40 -1 stat(/var/lib/ceph/mon/ceph-a) (13) Permission denied
debug 2023-07-26T04:37:11.189+0000 2b0188522c40 -1 error opening '/var/lib/ceph/mon/ceph-a': (13) Permission denied

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read GitHub documentation if you need help.

Cluster Status to submit:

# kubectl get pods -n rook-ceph
NAME                                                        READY   STATUS                  RESTARTS          AGE
csi-rbdplugin-9mrdc                                         2/2     Running                 3 (20h ago)       6d20h
csi-rbdplugin-provisioner-54f7bdb897-vfknm                  5/5     Running                 7 (20h ago)       6d20h
rook-ceph-crashcollector-test101-78747cd48c-26zf8   1/1     Running                 0                 38s
rook-ceph-mon-a-86b549b854-p8lnl                            0/2     Init:CrashLoopBackOff   2 (18s ago)       38s
rook-ceph-operator-b95c97c54-l9hsh                          1/1     Running                 0                 68s
rook-ceph-osd-prepare-test101--1-46pd6              0/1     Completed               0                 6d20h
rook-ceph-rgw-ceph-objectstore-a-587fbfcc4b-nhjtq           1/2     Running                 214 (3m25s ago)   6d20h
rook-ceph-tools-6f9d969fbf-j6ldw                            1/1     Running                 1 (20h ago)       6d20h

Output of krew commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health To get the status of the cluster, use kubectl rook-ceph ceph status For more details, see the Rook Krew Plugin

Environment:

OS (e.g. from /etc/os-release): rhel-server-7.9
Kernel (e.g. uname -a): 3.10.0-1160
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod): 1.11.4
Storage backend version (e.g. for ceph do ceph -v): 17.6
Kubernetes version (use kubectl version): 1.22.6
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE2
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

About this issue

Original URL
State: closed
Created a year ago
Comments: 21 (10 by maintainers)

Most upvoted comments

Ok, so the mon pods just can’t access the data, I’m not clear why that would be. Have you seen this issue beyond this cluster? Or do you have another cluster where you could test if it repros?

I deployed ceph cluster using rook more than 100 times this two months but this is the first time that i meet this problem, i will change the path from /var/lib/rook to another place again later.

CrossainQi on Aug 1, 2023