rook: Ceph cluster lost and any recovery solution not worked. All OSDs are `in` but all PGs are `unknown`.

We’ve had a serious problem with our production cluster & we need urgent help. Thanks in advance.

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: After updating our kubernetes and rebooting our servers, our OSDs have stopped working. We tried our best to restore them, using all different methods, checking all possibilities, issues, documents, and configurations. We restored our Kubernetes snapshot, and we restored the ETCD. Currently, the PGs status shows as unknown.

Expected behavior: OSDs should have a normal state after an upgrade.

How to reproduce it (minimal and precise):

Update kubernetes v1.17 to v1.18 and change all certificates then rook and ceph become unstable!
Restore ETCD to last stable backup of v1.17 using rke etcd snapshot-restore ... and rke up again.

Best solution we applied to restore old ceph cluster

Start a new and clean Rook Ceph cluster, with old CephCluster CephBlockPool CephFilesystem CephNFS CephObjectStore.
Shut the new cluster down when it has been created successfully.
Replace ceph-mon data with that of the old cluster.
Replace fsid in secrets/rook-ceph-mon with that of the old one.
Fix monmap in ceph-mon db.
Fix ceph mon auth key.
Disable auth.
Start the new cluster, watch it resurrect. Reference: https://rook.github.io/docs/rook/v1.6/ceph-disaster-recovery.html

Current state after all recovery solutions Rook operator, manager, monitors, OSD pods, all agents are ready without fundamental errors. All OSDs are in but down Pools found but All PGs are unknown! ceph -s

  cluster:
    id:     .....
    health: HEALTH_WARN
            nodown,noout,norebalance flag(s) set
            Reduced data availability: 64 pgs inactive
            33 slow ops, oldest one blocked for 79746 sec, mon.a has slow ops

  services:
    mon: 1 daemons, quorum a (age 22h)
    mgr: a(active, since 22h)
    osd: 33 osds: 0 up, 33 in (since 22h)
         flags nodown,noout,norebalance

  data:
    pools:   2 pools, 64 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             64 unknown

Environment:

OS:RancherOS v1.5.8
Kernel: v4.14.138-rancher
Cloud provider: Bare-metal and installed with RKE
Kubernetes v1.17
Ceph: v14.2.9 updated to v16.2.4 in recovery process
Rook: v1.2.7 updated to v1.6.7 in recovery process

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (6 by maintainers)

Most upvoted comments

Hi @AliMD ,

I am having the same problem too… did you manage to recover your ceph cluster?

farhansj on Oct 20, 2021