rook: Two OSDs on same node have their IDs mixed and cannot start
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: I found 2 OSDs in CrashLoopBackOff on our Rook 1.6.0 / Ceph 15.2.10 Cluster: Apparently they think they have the OSD ID of the respective other? Both OSDs are on the same Node.
❯ k logs rook-ceph-osd-13-5fd7c5f97-hmhct -p
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40 0 set uid:gid to 167:167 (ceph:ceph)
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40 0 ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02) octopus (stable), process ceph-osd, pid 1
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40 0 pidfile_write: ignore empty --pid-file
debug 2021-05-16T15:37:22.188+0000 7fd734fa3f40 -1 OSD id 7 != my id 13
❯ k logs rook-ceph-osd-7-654b8cdd4-t8b9q
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40 0 set uid:gid to 167:167 (ceph:ceph)
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40 0 ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02) octopus (stable), process ceph-osd, pid 1
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40 0 pidfile_write: ignore empty --pid-file
debug 2021-05-16T15:35:22.850+0000 7f93da0f0f40 -1 OSD id 13 != my id 7
I had a look and the mounts configured on both pods and they seem to be the wrong way around. The pod OSD7 mounts /dev/sdc and OSD13 mount /dev/sdd on the same node. But the ceph-osd-prepare job which ran on that node shows:
2021-05-17 02:41:22.992328 D | cephosd: {
"13": {
"ceph_fsid": "3ed7d3a5-6498-43cf-bca3-456fa670d83b",
"device": "/dev/sdc",
"osd_id": 13,
"osd_uuid": "b4027cff-0da6-464f-91c6-c06d6a569e5a",
"type": "bluestore"
},
"7": {
"ceph_fsid": "3ed7d3a5-6498-43cf-bca3-456fa670d83b",
"device": "/dev/sdd",
"osd_id": 7,
"osd_uuid": "285e698f-651c-4a2c-9314-bf7c23d128cc",
"type": "bluestore"
}
}
Both OSDs used to work just fine but then the mounts got twisted.
Expected behavior: OSDs work as normal
Environment:
- OS (e.g. from /etc/os-release): flatcar 2605.5.0 (Oklo)
- Kernel (e.g.
uname -a): 5.4.66-flatcar - Cloud provider or hardware configuration: bare metal
- Rook version (use
rook versioninside of a Rook Pod): 1.6.0 - Storage backend version (e.g. for ceph do
ceph -v): 15.2.10 - Kubernetes version (use
kubectl version):1.15.11 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): vanilla
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox): HEALTH_OK
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (13 by maintainers)
Commits related to this issue
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to rook/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to henryzhangsta/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to henryzhangsta/rook by BlaineEXE 3 years ago
- ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to subhamkrai/rook by BlaineEXE 3 years ago
Hey, I’m still seeing this issue with the latest version of Rook (v1.10.3).
The Kubernetes cluster was offline for a bit, and is not able to come back up.
Some OSDs came online, but others didn’t and continue not to even after restarting the nodes a few times.
The devices are explicitly requested by
/dev/disk/by-id/scsi-*, but I understand Rook does not really care about this and translates it to an unstable/dev/sdX.https://github.com/uhthomas/automata/blob/ce6ae68d71b90388d4b651911f7f2c0ee6858ca2/k8s/pillowtalk/rook_ceph/ceph_cluster_list.cue#L245-L272
Frustratingly, the devices are all there. I can manually verify they are there and the OSD pods find the correct devices, but for different OSDs.
It seems
ceph-volume raw listis not returning the right info sometimes, which means the code implemented to fix this does not work.See for example the log output for OSD pods 0 and 2:
OSD 0
OSD 2
So, they are finding eachothers OSDs, and then the attempt to find the actual device elsewhere returns irrelevant data.
Any help would be greatly appreciated.
Planning to include this in v1.6.3 later today…
I see, the raw mode change does explain the fundamental issue, thanks.
One part of the issue is the latest changes to update OSDs in parallel. Before we updated OSDs in parallel, any time the osd-prepare job would run, it would update OSD deployments based on the latest prepare-job output, which would fix any issues caused by disks having different names after boot.
However, this would not prevent disks from failing for a time after the node was rebooted. I wonder if ceph-volume’s raw mode behavior differs from the lvm mode behavior. I am currently investigating the differences in lvm mode output.
[Update] ceph-volume in lvm mode outputs disks as
/dev/mapper/MbEfX7-3xJF-R4Jm-Csz2-wfQ3-yGeq-VoXTwzdevices which according to this page are consistent.The root cause of this problem is that
ceph-volume raw listdoes not output consistent device names (rather the user-friendly names). This shows up because switched to using raw mode in v1.6. This is further exacerbated due to Rook not being able to automatically recover from these types of failures because of the way we changed the OSD update process to update in parallel.The tools container I executed ceph-volume in was running on the node the OSDs are on and was privileged. On the node itself I do not have ceph-volume binary available. Since we’re using a very basic image for the nodes (flatcar) I was not able to run ceph-volume directly on the node. I tried to use the rook image as https://github.com/kinvolk/toolbox but that results in the same empty result for “ceph-volume raw list” giving a “{}”.
Thanks, I attached logs and manifest here: rook.zip.