rook: Mon pod gets into Init:CrashLoopBackOff after reboot
Hi there, I deployed a ceph cluster serveral days ago and it worked well, yesterday i reboot the instance to find the mon pod turned to Init:CrashLoopBackOff state, I edit the operator followed by https://github.com/rook/rook/issues/3485 but it did not works, any suggestions? Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
Expected behavior: mon pods turns to running state How to reproduce it (minimal and precise):
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary
Logs to submit:
- Operator’s logs, if necessary
...
2023-07-26 04:40:07.459289 I | op-mon: mon a is not yet running
2023-07-26 04:40:07.459348 I | op-mon: mons running: []
2023-07-26 04:40:07.459394 D | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-07-26 04:40:07.788130 D | ceph-object-controller: object store "rook-ceph/ceph-objectstore" status updated to "Progressing"
2023-07-26 04:40:07.788232 D | ceph-spec: "ceph-object-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:07.788261 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:07.788272 D | ceph-spec: "ceph-object-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:07.788296 D | ceph-object-controller: successfully configured CephObjectStore "rook-ceph/ceph-objectstore"
2023-07-26 04:40:11.853829 D | clusterdisruption-controller: ceph "rook-ceph" cluster failed to check cluster health. failed to get status. . timed out: exit status 1
2023-07-26 04:40:11.853942 D | clusterdisruption-controller: reconciling "rook-ceph/"
2023-07-26 04:40:11.854123 D | clusterdisruption-controller: could not match failure domain. defaulting to "host"
2023-07-26 04:40:11.854204 D | exec: Running command: ceph status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-07-26 04:40:16.996399 D | ceph-spec: "ceph-block-pool-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:16.996456 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:16.996464 D | ceph-spec: "ceph-block-pool-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:16.996478 D | ceph-block-pool-controller: successfully configured CephBlockPool "rook-ceph/ceph-blockpool"
2023-07-26 04:40:17.800489 D | ceph-object-controller: object store "rook-ceph/ceph-objectstore" status updated to "Progressing"
2023-07-26 04:40:17.800615 D | ceph-spec: "ceph-object-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:17.800676 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:17.800690 D | ceph-spec: "ceph-object-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:17.800739 D | ceph-object-controller: successfully configured CephObjectStore "rook-ceph/ceph-objectstore"
2023-07-26 04:40:23.455699 D | op-mon: failed to get quorum_status. mon quorum status failed: exit status 1
2023-07-26 04:40:26.997288 D | ceph-spec: "ceph-block-pool-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:26.997344 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:26.997352 D | ceph-spec: "ceph-block-pool-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:26.997365 D | ceph-block-pool-controller: successfully configured CephBlockPool "rook-ceph/ceph-blockpool"
2023-07-26 04:40:27.777168 D | clusterdisruption-controller: ceph "rook-ceph" cluster failed to check cluster health. failed to get status. . timed out: exit status 1
2023-07-26 04:40:27.777274 D | clusterdisruption-controller: reconciling "rook-ceph/rook-ceph"
2023-07-26 04:40:27.777505 D | clusterdisruption-controller: could not match failure domain. defaulting to "host"
2023-07-26 04:40:27.777594 D | exec: Running command: ceph status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-07-26 04:40:27.817158 D | ceph-object-controller: object store "rook-ceph/ceph-objectstore" status updated to "Progressing"
2023-07-26 04:40:27.817272 D | ceph-spec: "ceph-object-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:27.817309 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:27.817321 D | ceph-spec: "ceph-object-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:27.817344 D | ceph-object-controller: successfully configured CephObjectStore "rook-ceph/ceph-objectstore"
2023-07-26 04:40:28.468172 I | op-mon: mon a is not yet running
2023-07-26 04:40:28.468219 I | op-mon: mons running: []
2023-07-26 04:40:28.468254 D | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-07-26 04:40:36.998260 D | ceph-spec: "ceph-block-pool-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:36.998318 I | ceph-spec: ceph-block-pool-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:36.998327 D | ceph-spec: "ceph-block-pool-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:36.998340 D | ceph-block-pool-controller: successfully configured CephBlockPool "rook-ceph/ceph-blockpool"
2023-07-26 04:40:37.833455 D | ceph-object-controller: object store "rook-ceph/ceph-objectstore" status updated to "Progressing"
2023-07-26 04:40:37.833594 D | ceph-spec: "ceph-object-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2023-07-26 04:40:37.833635 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2023-07-25T09:52:09Z LastChanged:2023-07-25T08:27:57Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:214748364800 UsedBytes:24040493056 AvailableBytes:190707871744 LastUpdated:2023-07-25T08:23:25Z} Versions:<nil> FSID:}
2023-07-26 04:40:37.833647 D | ceph-spec: "ceph-object-controller": CephCluster "rook-ceph" initial reconcile is not complete yet...
2023-07-26 04:40:37.833670 D | ceph-object-controller: successfully configured CephObjectStore "rook-ceph/ceph-objectstore"
...
- Crashing pod(s) logs, if necessary
# kubectl logs -f -n rook-ceph rook-ceph-mon-a-86b549b854-p8lnl chown-container-data-dir
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.7.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.6.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.4.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.1.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.5.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.1.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.4.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.2.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.2.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.3.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.5.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.2.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.3.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.6.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.5.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.5.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.4.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.2.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-client.rgw.ceph.objectstore.a.log.6.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-volume.log' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.3.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.6.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.1.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-osd.0.log.4.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mgr.a.log.3.gz' retained as ceph:ceph
ownership of '/var/log/ceph/ceph-mon.a.log.1.gz' retained as ceph:ceph
ownership of '/var/log/ceph' retained as ceph:ceph
ownership of '/var/lib/ceph/crash/posted' retained as ceph:ceph
ownership of '/var/lib/ceph/crash' retained as ceph:ceph
ownership of '/run/ceph/ceph-osd.0.asok' retained as ceph:ceph
ownership of '/run/ceph/ceph-mgr.a.asok' retained as ceph:ceph
ownership of '/run/ceph' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/external_log_to' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/kv_backend' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/keyring' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/min_mon_release' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/CURRENT' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/015298.log' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/MANIFEST-000009' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/OPTIONS-000012' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/015300.sst' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/OPTIONS-000006' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/IDENTITY' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db/LOCK' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a/store.db' retained as ceph:ceph
ownership of '/var/lib/ceph/mon/ceph-a' retained as ceph:ceph
# kubectl logs -f -n rook-ceph rook-ceph-mon-a-86b549b854-p8lnl init-mon-fs
debug 2023-07-26T04:37:11.189+0000 2b0188522c40 0 set uid:gid to 167:167 (ceph:ceph)
debug 2023-07-26T04:37:11.189+0000 2b0188522c40 -1 stat(/var/lib/ceph/mon/ceph-a) (13) Permission denied
debug 2023-07-26T04:37:11.189+0000 2b0188522c40 -1 error opening '/var/lib/ceph/mon/ceph-a': (13) Permission denied
To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.
Cluster Status to submit:
# kubectl get pods -n rook-ceph
NAME READY STATUS RESTARTS AGE
csi-rbdplugin-9mrdc 2/2 Running 3 (20h ago) 6d20h
csi-rbdplugin-provisioner-54f7bdb897-vfknm 5/5 Running 7 (20h ago) 6d20h
rook-ceph-crashcollector-test101-78747cd48c-26zf8 1/1 Running 0 38s
rook-ceph-mon-a-86b549b854-p8lnl 0/2 Init:CrashLoopBackOff 2 (18s ago) 38s
rook-ceph-operator-b95c97c54-l9hsh 1/1 Running 0 68s
rook-ceph-osd-prepare-test101--1-46pd6 0/1 Completed 0 6d20h
rook-ceph-rgw-ceph-objectstore-a-587fbfcc4b-nhjtq 1/2 Running 214 (3m25s ago) 6d20h
rook-ceph-tools-6f9d969fbf-j6ldw 1/1 Running 1 (20h ago) 6d20h
-
Output of krew commands, if necessary
To get the health of the cluster, use
kubectl rook-ceph healthTo get the status of the cluster, usekubectl rook-ceph ceph statusFor more details, see the Rook Krew Plugin
Environment:
- OS (e.g. from /etc/os-release): rhel-server-7.9
- Kernel (e.g.
uname -a): 3.10.0-1160 - Cloud provider or hardware configuration:
- Rook version (use
rook versioninside of a Rook Pod): 1.11.4 - Storage backend version (e.g. for ceph do
ceph -v): 17.6 - Kubernetes version (use
kubectl version): 1.22.6 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE2
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 21 (10 by maintainers)
I deployed ceph cluster using rook more than 100 times this two months but this is the first time that i meet this problem, i will change the path from /var/lib/rook to another place again later.