rook: manually repair OSD after rook cluster fails after k8s node restart
I was faced with the problem of falling OSD after restarting any node k8s. It is alleged that the OSD needs to stand up for himself, But it’s not working for me. The problem is described in more detail here: https://github.com/rook/rook/issues/1278
And here I want to discuss possible ways to get OSD to work manually.
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.46196 root default
-2 0.09239 host 10-1-29-31
1 hdd 0.09239 osd.1 up 1.00000 1.00000
-11 0.09239 host 10-1-29-32
4 hdd 0.09239 osd.4 up 1.00000 1.00000
-3 0.09239 host 10-1-29-33
0 hdd 0.09239 osd.0 up 1.00000 1.00000
-9 0.09239 host 10-1-29-34
2 hdd 0.09239 osd.2 down 0 1.00000
-4 0.09239 host 10-1-29-35
3 hdd 0.09239 osd.3 up 1.00000 1.00000
I have tried the following methods found in the documentation:
ceph osd repair osd.2
Error EAGAIN: osd.2 is not up
ceph osd up osd.2
no valid command found; 10 closest matches:
osd count-metadata <property>
osd versions
osd find <osdname (id|osd.id)>
osd metadata {<osdname (id|osd.id)>}
osd getmaxosd
osd ls-tree {<int[0-]>} {<name>}
osd getmap {<int[0-]>}
osd getcrushmap {<int[0-]>}
osd tree {<int[0-]>} {up|down|in|out|destroyed [up|down|in|out|destroyed...]}
osd ls {<int[0-]>}
Error EINVAL: invalid command
ceph-volume simple activate osd.2
\--> RuntimeError: Expected JSON config path not found: /etc/ceph/osd/osd.2-None.json
all of them lead to errors and do not give the expected result. maybe I’m moving in the wrong direction?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 22 (16 by maintainers)
We think we may have found a manual recovery method.
The pod will be rescheduled and almost immediately the OSD will be UP and IN. Cluster remains fully RW to radosgw traffic during this entire process. This has only been tested with Replicated data pools so far, testing EC pools next.
This is less than ideal, as it requires manual intervention each time an OSD auto-outs. We would definitely like to work with the team to share any information we’ve found to help fix this.
because manual OSD recovery is not possible, this issue can be closed.