rook: Two OSDs on same node have their IDs mixed and cannot start

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: I found 2 OSDs in CrashLoopBackOff on our Rook 1.6.0 / Ceph 15.2.10 Cluster: Apparently they think they have the OSD ID of the respective other? Both OSDs are on the same Node.

❯ k logs rook-ceph-osd-13-5fd7c5f97-hmhct -p
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40  0 set uid:gid to 167:167 (ceph:ceph)
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40  0 ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02) octopus (stable), process ceph-osd, pid 1
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40  0 pidfile_write: ignore empty --pid-file
debug 2021-05-16T15:37:22.188+0000 7fd734fa3f40 -1 OSD id 7 != my id 13
❯ k logs rook-ceph-osd-7-654b8cdd4-t8b9q
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40  0 set uid:gid to 167:167 (ceph:ceph)
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40  0 ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02) octopus (stable), process ceph-osd, pid 1
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40  0 pidfile_write: ignore empty --pid-file
debug 2021-05-16T15:35:22.850+0000 7f93da0f0f40 -1 OSD id 13 != my id 7

I had a look and the mounts configured on both pods and they seem to be the wrong way around. The pod OSD7 mounts /dev/sdc and OSD13 mount /dev/sdd on the same node. But the ceph-osd-prepare job which ran on that node shows:

2021-05-17 02:41:22.992328 D | cephosd: {                                                                                                                                                                           
    "13": {                                                                                                                                                                                                         
        "ceph_fsid": "3ed7d3a5-6498-43cf-bca3-456fa670d83b",                                                                                                                                                        
        "device": "/dev/sdc",                                                                                                                                                                                       
        "osd_id": 13,                                                                                                                                                                                               
        "osd_uuid": "b4027cff-0da6-464f-91c6-c06d6a569e5a",                                                                                                                                                         
        "type": "bluestore"                                                                                                                                                                                         
    },                                                                                                                                                                                                              
    "7": {                                                                                                                                                                                                          
        "ceph_fsid": "3ed7d3a5-6498-43cf-bca3-456fa670d83b",                                                                                                                                                        
        "device": "/dev/sdd",                                                                                                                                                                                       
        "osd_id": 7,                                                                                                                                                                                                
        "osd_uuid": "285e698f-651c-4a2c-9314-bf7c23d128cc",                                                                                                                                                         
        "type": "bluestore"                                                                                                                                                                                         
    }                                                                                                                                                                                                               
}

Both OSDs used to work just fine but then the mounts got twisted.

Expected behavior: OSDs work as normal

Environment:

OS (e.g. from /etc/os-release): flatcar 2605.5.0 (Oklo)
Kernel (e.g. uname -a): 5.4.66-flatcar
Cloud provider or hardware configuration: bare metal
Rook version (use rook version inside of a Rook Pod): 1.6.0
Storage backend version (e.g. for ceph do ceph -v): 15.2.10
Kubernetes version (use kubectl version):1.15.11
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): vanilla
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (13 by maintainers)

Commits related to this issue

ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to BlaineEXE/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to rook/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to henryzhangsta/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to henryzhangsta/rook by BlaineEXE 3 years ago
ceph: scan raw OSDs on nodes in OSD init container The 'ceph-volume raw list' command used to list OSDs on nodes by the rook-ceph-osd-prepare- jobs return user-friendly device names (e.g., /dev/sda) ... — committed to subhamkrai/rook by BlaineEXE 3 years ago

Most upvoted comments

Hey, I’m still seeing this issue with the latest version of Rook (v1.10.3).

The Kubernetes cluster was offline for a bit, and is not able to come back up.

❯ k -n rook-ceph exec -it rook-ceph-tools-c649479d9-vr77k -- /bin/bash
bash-4.4$ ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME            STATUS  REWEIGHT  PRI-AFF
-1         27.29013  root default
-5          9.09668      host 13a0a37008
 1    hdd   5.45799          osd.1            up   1.00000  1.00000
 3    hdd   3.63869          osd.3          down         0  1.00000
-7          9.09677      host 38aab880f5
 4    hdd   3.63869          osd.4            up   1.00000  1.00000
 5    hdd   3.63869          osd.5            up   1.00000  1.00000
 6    hdd   1.81940          osd.6          down         0  1.00000
-3          9.09668      host 6151c4656b
 0    hdd   3.63869          osd.0          down         0  1.00000
 2    hdd   5.45799          osd.2          down   1.00000  1.00000

Some OSDs came online, but others didn’t and continue not to even after restarting the nodes a few times.

The devices are explicitly requested by /dev/disk/by-id/scsi-*, but I understand Rook does not really care about this and translates it to an unstable /dev/sdX.

https://github.com/uhthomas/automata/blob/ce6ae68d71b90388d4b651911f7f2c0ee6858ca2/k8s/pillowtalk/rook_ceph/ceph_cluster_list.cue#L245-L272

Frustratingly, the devices are all there. I can manually verify they are there and the OSD pods find the correct devices, but for different OSDs.

It seems ceph-volume raw list is not returning the right info sometimes, which means the code implemented to fix this does not work.

See for example the log output for OSD pods 0 and 2:

OSD 0

❯ k -n rook-ceph logs rook-ceph-osd-0-8484c77d6-mnk82 --all-containers
+ OSD_ID=0
+ OSD_UUID=a284defb-ef8d-4e31-a692-4112a55cab6c
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-0
+ CV_MODE=raw
+ DEVICE=/dev/sdc
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.e4ww3g51Tw
+ ceph-volume raw list /dev/sdc
+ cat /tmp/tmp.e4ww3g51Tw
{
    "2": {
        "ceph_fsid": "08c9912c-1b8d-4660-bb65-9ee764562910",
        "device": "/dev/sdc",
        "osd_id": 2,
        "osd_uuid": "90c99c56-1034-414a-aa69-f886e958b69e",
        "type": "bluestore"
    }
}
+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 0:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 0'\'')
'
no disk found with OSD ID 0
+ ceph-volume raw list
+ cat /tmp/tmp.e4ww3g51Tw
{
    "12": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdb2",
        "osd_id": 12,
        "osd_uuid": "e906f5b4-20cd-4170-939a-681f95ec4679",
        "type": "bluestore"
    },
    "8": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdc2",
        "osd_id": 8,
        "osd_uuid": "8c5ac8b3-4d0c-4340-85ab-86080780b5a5",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 0:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 0'\'')
'
no disk found with OSD ID 0
+ DEVICE=
+ OSD_ID=0
+ OSD_UUID=a284defb-ef8d-4e31-a692-4112a55cab6c
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-0
+ CV_MODE=raw
+ DEVICE=/dev/sdc
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.zLAk8ZYhJT
+ ceph-volume raw list /dev/sdc
+ cat /tmp/tmp.zLAk8ZYhJT
{
    "2": {
        "ceph_fsid": "08c9912c-1b8d-4660-bb65-9ee764562910",
        "device": "/dev/sdc",
        "osd_id": 2,
        "osd_uuid": "90c99c56-1034-414a-aa69-f886e958b69e",
        "type": "bluestore"
    }
}
+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 0:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 0'\'')
'
no disk found with OSD ID 0
+ ceph-volume raw list
+ cat /tmp/tmp.zLAk8ZYhJT
{
    "12": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdb2",
        "osd_id": 12,
        "osd_uuid": "e906f5b4-20cd-4170-939a-681f95ec4679",
        "type": "bluestore"
    },
    "8": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdc2",
        "osd_id": 8,
        "osd_uuid": "8c5ac8b3-4d0c-4340-85ab-86080780b5a5",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 0:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 0'\'')
'
no disk found with OSD ID 0
+ DEVICE=
Error from server (BadRequest): container "chown-container-data-dir" in pod "rook-ceph-osd-0-8484c77d6-mnk82" is waiting to start: PodInitializing

OSD 2

❯ k -n rook-ceph logs rook-ceph-osd-2-67d97456cd-rt7rn --all-containers
+ OSD_ID=2
+ OSD_UUID=90c99c56-1034-414a-aa69-f886e958b69e
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-2
+ CV_MODE=raw
+ DEVICE=/dev/sdb
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.Hpnav7NVfG
+ ceph-volume raw list /dev/sdb
+ cat /tmp/tmp.Hpnav7NVfG
{
    "0": {
        "ceph_fsid": "08c9912c-1b8d-4660-bb65-9ee764562910",
        "device": "/dev/sdb",
        "osd_id": 0,
        "osd_uuid": "a284defb-ef8d-4e31-a692-4112a55cab6c",
        "type": "bluestore"
    }
}
+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 2:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 2'\'')
'
no disk found with OSD ID 2
+ ceph-volume raw list
+ cat /tmp/tmp.Hpnav7NVfG
{
    "12": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdb2",
        "osd_id": 12,
        "osd_uuid": "e906f5b4-20cd-4170-939a-681f95ec4679",
        "type": "bluestore"
    },
    "8": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdc2",
        "osd_id": 8,
        "osd_uuid": "8c5ac8b3-4d0c-4340-85ab-86080780b5a5",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 2:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 2'\'')
'
no disk found with OSD ID 2
+ DEVICE=
+ OSD_ID=2
+ OSD_UUID=90c99c56-1034-414a-aa69-f886e958b69e
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-2
+ CV_MODE=raw
+ DEVICE=/dev/sdb
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.aE0Ds8D9G4
+ ceph-volume raw list /dev/sdb
+ cat /tmp/tmp.aE0Ds8D9G4
{
    "0": {
        "ceph_fsid": "08c9912c-1b8d-4660-bb65-9ee764562910",
        "device": "/dev/sdb",
        "osd_id": 0,
        "osd_uuid": "a284defb-ef8d-4e31-a692-4112a55cab6c",
        "type": "bluestore"
    }
}
+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 2:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 2'\'')
'
no disk found with OSD ID 2
+ ceph-volume raw list
+ cat /tmp/tmp.aE0Ds8D9G4
{
    "12": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdb2",
        "osd_id": 12,
        "osd_uuid": "e906f5b4-20cd-4170-939a-681f95ec4679",
        "type": "bluestore"
    },
    "8": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdc2",
        "osd_id": 8,
        "osd_uuid": "8c5ac8b3-4d0c-4340-85ab-86080780b5a5",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 2:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 2'\'')
'
no disk found with OSD ID 2
+ DEVICE=
Error from server (BadRequest): container "chown-container-data-dir" in pod "rook-ceph-osd-2-67d97456cd-rt7rn" is waiting to start: PodInitializing

So, they are finding eachothers OSDs, and then the attempt to find the actual device elsewhere returns irrelevant data.

Any help would be greatly appreciated.

uhthomas on Oct 16, 2022

Planning to include this in v1.6.3 later today…

travisn on May 20, 2021

I see, the raw mode change does explain the fundamental issue, thanks.

travisn on May 18, 2021

One part of the issue is the latest changes to update OSDs in parallel. Before we updated OSDs in parallel, any time the osd-prepare job would run, it would update OSD deployments based on the latest prepare-job output, which would fix any issues caused by disks having different names after boot.

However, this would not prevent disks from failing for a time after the node was rebooted. I wonder if ceph-volume’s raw mode behavior differs from the lvm mode behavior. I am currently investigating the differences in lvm mode output.

[Update] ceph-volume in lvm mode outputs disks as /dev/mapper/MbEfX7-3xJF-R4Jm-Csz2-wfQ3-yGeq-VoXTwz devices which according to this page are consistent.

The root cause of this problem is that ceph-volume raw list does not output consistent device names (rather the user-friendly names). This shows up because switched to using raw mode in v1.6. This is further exacerbated due to Rook not being able to automatically recover from these types of failures because of the way we changed the OSD update process to update in parallel.

BlaineEXE on May 18, 2021

If you want to look at the ceph-volume output you need to run it on a machine with OSDs. It’s a local call that scans disks. It’s the source of truth, can you run it on a node and paste the output?

The tools container I executed ceph-volume in was running on the node the OSDs are on and was privileged. On the node itself I do not have ceph-volume binary available. Since we’re using a very basic image for the nodes (flatcar) I was not able to run ceph-volume directly on the node. I tried to use the rook image as https://github.com/kinvolk/toolbox but that results in the same empty result for “ceph-volume raw list” giving a “{}”.

It would also be helpful to get the CephCluster manifest as well as debug logs (ROOK_LOG_LEVEL=DEBUG) from the rook-ceph-operator-HHHHHH pod. You can attach those as files here.

Thanks, I attached logs and manifest here: rook.zip.

jastBytes on May 18, 2021