rook: manually repair OSD after rook cluster fails after k8s node restart

I was faced with the problem of falling OSD after restarting any node k8s. It is alleged that the OSD needs to stand up for himself, But it’s not working for me. The problem is described in more detail here: https://github.com/rook/rook/issues/1278

And here I want to discuss possible ways to get OSD to work manually.

ceph osd tree
ID  CLASS WEIGHT  TYPE NAME           STATUS REWEIGHT PRI-AFF
 -1       0.46196 root default
 -2       0.09239     host 10-1-29-31
  1   hdd 0.09239         osd.1           up  1.00000 1.00000
-11       0.09239     host 10-1-29-32
  4   hdd 0.09239         osd.4           up  1.00000 1.00000
 -3       0.09239     host 10-1-29-33
  0   hdd 0.09239         osd.0           up  1.00000 1.00000
 -9       0.09239     host 10-1-29-34
  2   hdd 0.09239         osd.2         down        0 1.00000
 -4       0.09239     host 10-1-29-35
  3   hdd 0.09239         osd.3           up  1.00000 1.00000

I have tried the following methods found in the documentation:

ceph osd repair osd.2
Error EAGAIN: osd.2 is not up
ceph osd up osd.2
no valid command found; 10 closest matches:
osd count-metadata <property>
osd versions
osd find <osdname (id|osd.id)>
osd metadata {<osdname (id|osd.id)>}
osd getmaxosd
osd ls-tree {<int[0-]>} {<name>}
osd getmap {<int[0-]>}
osd getcrushmap {<int[0-]>}
osd tree {<int[0-]>} {up|down|in|out|destroyed [up|down|in|out|destroyed...]}
osd ls {<int[0-]>}
Error EINVAL: invalid command
ceph-volume simple activate osd.2
\-->  RuntimeError: Expected JSON config path not found: /etc/ceph/osd/osd.2-None.json

all of them lead to errors and do not give the expected result. maybe I’m moving in the wrong direction?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 22 (16 by maintainers)

Most upvoted comments

We think we may have found a manual recovery method.

  1. Spin up fresh 3 osd cluster
  2. Down a k8s node long enough for related OSD to autoout
  3. Bring node back up and wait for it to reschedule all pods
  4. Ceph osd rm the offending OSD
  5. Kill the pod which was running the offending OSD

The pod will be rescheduled and almost immediately the OSD will be UP and IN. Cluster remains fully RW to radosgw traffic during this entire process. This has only been tested with Replicated data pools so far, testing EC pools next.

This is less than ideal, as it requires manual intervention each time an OSD auto-outs. We would definitely like to work with the team to share any information we’ve found to help fix this.

because manual OSD recovery is not possible, this issue can be closed.