rook: Ceph cluster lost and any recovery solution not worked. All OSDs are `in` but all PGs are `unknown`.
We’ve had a serious problem with our production cluster & we need urgent help. Thanks in advance.
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: After updating our kubernetes and rebooting our servers, our OSDs have stopped working. We tried our best to restore them, using all different methods, checking all possibilities, issues, documents, and configurations. We restored our Kubernetes snapshot, and we restored the ETCD. Currently, the PGs status shows as unknown.
Expected behavior: OSDs should have a normal state after an upgrade.
How to reproduce it (minimal and precise):
- Update kubernetes
v1.17tov1.18and change all certificates then rook and ceph become unstable! - Restore ETCD to last stable backup of
v1.17usingrke etcd snapshot-restore ...andrke upagain.
Best solution we applied to restore old ceph cluster
- Start a new and clean Rook Ceph cluster, with old CephCluster CephBlockPool CephFilesystem CephNFS CephObjectStore.
- Shut the new cluster down when it has been created successfully.
- Replace ceph-mon data with that of the old cluster.
- Replace fsid in secrets/rook-ceph-mon with that of the old one.
- Fix monmap in ceph-mon db.
- Fix ceph mon auth key.
- Disable auth.
- Start the new cluster, watch it resurrect. Reference: https://rook.github.io/docs/rook/v1.6/ceph-disaster-recovery.html
Current state after all recovery solutions
Rook operator, manager, monitors, OSD pods, all agents are ready without fundamental errors.
All OSDs are in but down
Pools found but All PGs are unknown!
ceph -s
cluster:
id: .....
health: HEALTH_WARN
nodown,noout,norebalance flag(s) set
Reduced data availability: 64 pgs inactive
33 slow ops, oldest one blocked for 79746 sec, mon.a has slow ops
services:
mon: 1 daemons, quorum a (age 22h)
mgr: a(active, since 22h)
osd: 33 osds: 0 up, 33 in (since 22h)
flags nodown,noout,norebalance
data:
pools: 2 pools, 64 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
64 unknown
Environment:
- OS:RancherOS v1.5.8
- Kernel: v4.14.138-rancher
- Cloud provider: Bare-metal and installed with RKE
- Kubernetes v1.17
- Ceph: v14.2.9 updated to v16.2.4 in recovery process
- Rook: v1.2.7 updated to v1.6.7 in recovery process
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 21 (6 by maintainers)
Hi @AliMD ,
I am having the same problem too… did you manage to recover your ceph cluster?