rook: Two OSDs on same node have their IDs mixed and cannot start

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: I found 2 OSDs in CrashLoopBackOff on our Rook 1.6.0 / Ceph 15.2.10 Cluster: Apparently they think they have the OSD ID of the respective other? Both OSDs are on the same Node.

❯ k logs rook-ceph-osd-13-5fd7c5f97-hmhct -p
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40  0 set uid:gid to 167:167 (ceph:ceph)
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40  0 ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02) octopus (stable), process ceph-osd, pid 1
debug 2021-05-16T15:37:22.171+0000 7fd734fa3f40  0 pidfile_write: ignore empty --pid-file
debug 2021-05-16T15:37:22.188+0000 7fd734fa3f40 -1 OSD id 7 != my id 13
❯ k logs rook-ceph-osd-7-654b8cdd4-t8b9q
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40  0 set uid:gid to 167:167 (ceph:ceph)
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40  0 ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02) octopus (stable), process ceph-osd, pid 1
debug 2021-05-16T15:35:22.827+0000 7f93da0f0f40  0 pidfile_write: ignore empty --pid-file
debug 2021-05-16T15:35:22.850+0000 7f93da0f0f40 -1 OSD id 13 != my id 7

I had a look and the mounts configured on both pods and they seem to be the wrong way around. The pod OSD7 mounts /dev/sdc and OSD13 mount /dev/sdd on the same node. But the ceph-osd-prepare job which ran on that node shows:

2021-05-17 02:41:22.992328 D | cephosd: {                                                                                                                                                                           
    "13": {                                                                                                                                                                                                         
        "ceph_fsid": "3ed7d3a5-6498-43cf-bca3-456fa670d83b",                                                                                                                                                        
        "device": "/dev/sdc",                                                                                                                                                                                       
        "osd_id": 13,                                                                                                                                                                                               
        "osd_uuid": "b4027cff-0da6-464f-91c6-c06d6a569e5a",                                                                                                                                                         
        "type": "bluestore"                                                                                                                                                                                         
    },                                                                                                                                                                                                              
    "7": {                                                                                                                                                                                                          
        "ceph_fsid": "3ed7d3a5-6498-43cf-bca3-456fa670d83b",                                                                                                                                                        
        "device": "/dev/sdd",                                                                                                                                                                                       
        "osd_id": 7,                                                                                                                                                                                                
        "osd_uuid": "285e698f-651c-4a2c-9314-bf7c23d128cc",                                                                                                                                                         
        "type": "bluestore"                                                                                                                                                                                         
    }                                                                                                                                                                                                               
}

Both OSDs used to work just fine but then the mounts got twisted.

Expected behavior: OSDs work as normal

Environment:

  • OS (e.g. from /etc/os-release): flatcar 2605.5.0 (Oklo)
  • Kernel (e.g. uname -a): 5.4.66-flatcar
  • Cloud provider or hardware configuration: bare metal
  • Rook version (use rook version inside of a Rook Pod): 1.6.0
  • Storage backend version (e.g. for ceph do ceph -v): 15.2.10
  • Kubernetes version (use kubectl version):1.15.11
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): vanilla
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Hey, I’m still seeing this issue with the latest version of Rook (v1.10.3).

The Kubernetes cluster was offline for a bit, and is not able to come back up.

❯ k -n rook-ceph exec -it rook-ceph-tools-c649479d9-vr77k -- /bin/bash
bash-4.4$ ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME            STATUS  REWEIGHT  PRI-AFF
-1         27.29013  root default
-5          9.09668      host 13a0a37008
 1    hdd   5.45799          osd.1            up   1.00000  1.00000
 3    hdd   3.63869          osd.3          down         0  1.00000
-7          9.09677      host 38aab880f5
 4    hdd   3.63869          osd.4            up   1.00000  1.00000
 5    hdd   3.63869          osd.5            up   1.00000  1.00000
 6    hdd   1.81940          osd.6          down         0  1.00000
-3          9.09668      host 6151c4656b
 0    hdd   3.63869          osd.0          down         0  1.00000
 2    hdd   5.45799          osd.2          down   1.00000  1.00000

Some OSDs came online, but others didn’t and continue not to even after restarting the nodes a few times.

The devices are explicitly requested by /dev/disk/by-id/scsi-*, but I understand Rook does not really care about this and translates it to an unstable /dev/sdX.

https://github.com/uhthomas/automata/blob/ce6ae68d71b90388d4b651911f7f2c0ee6858ca2/k8s/pillowtalk/rook_ceph/ceph_cluster_list.cue#L245-L272

Frustratingly, the devices are all there. I can manually verify they are there and the OSD pods find the correct devices, but for different OSDs.

It seems ceph-volume raw list is not returning the right info sometimes, which means the code implemented to fix this does not work.

See for example the log output for OSD pods 0 and 2:

OSD 0
❯ k -n rook-ceph logs rook-ceph-osd-0-8484c77d6-mnk82 --all-containers
+ OSD_ID=0
+ OSD_UUID=a284defb-ef8d-4e31-a692-4112a55cab6c
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-0
+ CV_MODE=raw
+ DEVICE=/dev/sdc
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.e4ww3g51Tw
+ ceph-volume raw list /dev/sdc
+ cat /tmp/tmp.e4ww3g51Tw
{
    "2": {
        "ceph_fsid": "08c9912c-1b8d-4660-bb65-9ee764562910",
        "device": "/dev/sdc",
        "osd_id": 2,
        "osd_uuid": "90c99c56-1034-414a-aa69-f886e958b69e",
        "type": "bluestore"
    }
}
+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 0:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 0'\'')
'
no disk found with OSD ID 0
+ ceph-volume raw list
+ cat /tmp/tmp.e4ww3g51Tw
{
    "12": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdb2",
        "osd_id": 12,
        "osd_uuid": "e906f5b4-20cd-4170-939a-681f95ec4679",
        "type": "bluestore"
    },
    "8": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdc2",
        "osd_id": 8,
        "osd_uuid": "8c5ac8b3-4d0c-4340-85ab-86080780b5a5",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 0:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 0'\'')
'
no disk found with OSD ID 0
+ DEVICE=
+ OSD_ID=0
+ OSD_UUID=a284defb-ef8d-4e31-a692-4112a55cab6c
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-0
+ CV_MODE=raw
+ DEVICE=/dev/sdc
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.zLAk8ZYhJT
+ ceph-volume raw list /dev/sdc
+ cat /tmp/tmp.zLAk8ZYhJT
{
    "2": {
        "ceph_fsid": "08c9912c-1b8d-4660-bb65-9ee764562910",
        "device": "/dev/sdc",
        "osd_id": 2,
        "osd_uuid": "90c99c56-1034-414a-aa69-f886e958b69e",
        "type": "bluestore"
    }
}
+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 0:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 0'\'')
'
no disk found with OSD ID 0
+ ceph-volume raw list
+ cat /tmp/tmp.zLAk8ZYhJT
{
    "12": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdb2",
        "osd_id": 12,
        "osd_uuid": "e906f5b4-20cd-4170-939a-681f95ec4679",
        "type": "bluestore"
    },
    "8": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdc2",
        "osd_id": 8,
        "osd_uuid": "8c5ac8b3-4d0c-4340-85ab-86080780b5a5",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 0:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 0'\'')
'
no disk found with OSD ID 0
+ DEVICE=
Error from server (BadRequest): container "chown-container-data-dir" in pod "rook-ceph-osd-0-8484c77d6-mnk82" is waiting to start: PodInitializing
OSD 2
❯ k -n rook-ceph logs rook-ceph-osd-2-67d97456cd-rt7rn --all-containers
+ OSD_ID=2
+ OSD_UUID=90c99c56-1034-414a-aa69-f886e958b69e
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-2
+ CV_MODE=raw
+ DEVICE=/dev/sdb
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.Hpnav7NVfG
+ ceph-volume raw list /dev/sdb
+ cat /tmp/tmp.Hpnav7NVfG
{
    "0": {
        "ceph_fsid": "08c9912c-1b8d-4660-bb65-9ee764562910",
        "device": "/dev/sdb",
        "osd_id": 0,
        "osd_uuid": "a284defb-ef8d-4e31-a692-4112a55cab6c",
        "type": "bluestore"
    }
}
+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 2:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 2'\'')
'
no disk found with OSD ID 2
+ ceph-volume raw list
+ cat /tmp/tmp.Hpnav7NVfG
{
    "12": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdb2",
        "osd_id": 12,
        "osd_uuid": "e906f5b4-20cd-4170-939a-681f95ec4679",
        "type": "bluestore"
    },
    "8": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdc2",
        "osd_id": 8,
        "osd_uuid": "8c5ac8b3-4d0c-4340-85ab-86080780b5a5",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 2:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 2'\'')
'
no disk found with OSD ID 2
+ DEVICE=
+ OSD_ID=2
+ OSD_UUID=90c99c56-1034-414a-aa69-f886e958b69e
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-2
+ CV_MODE=raw
+ DEVICE=/dev/sdb
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.aE0Ds8D9G4
+ ceph-volume raw list /dev/sdb
+ cat /tmp/tmp.aE0Ds8D9G4
{
    "0": {
        "ceph_fsid": "08c9912c-1b8d-4660-bb65-9ee764562910",
        "device": "/dev/sdb",
        "osd_id": 0,
        "osd_uuid": "a284defb-ef8d-4e31-a692-4112a55cab6c",
        "type": "bluestore"
    }
}
+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 2:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 2'\'')
'
no disk found with OSD ID 2
+ ceph-volume raw list
+ cat /tmp/tmp.aE0Ds8D9G4
{
    "12": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdb2",
        "osd_id": 12,
        "osd_uuid": "e906f5b4-20cd-4170-939a-681f95ec4679",
        "type": "bluestore"
    },
    "8": {
        "ceph_fsid": "bf5f1066-faff-459d-a259-6f17d63c1925",
        "device": "/dev/sdc2",
        "osd_id": 8,
        "osd_uuid": "8c5ac8b3-4d0c-4340-85ab-86080780b5a5",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 2:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 2'\'')
'
no disk found with OSD ID 2
+ DEVICE=
Error from server (BadRequest): container "chown-container-data-dir" in pod "rook-ceph-osd-2-67d97456cd-rt7rn" is waiting to start: PodInitializing

So, they are finding eachothers OSDs, and then the attempt to find the actual device elsewhere returns irrelevant data.

Any help would be greatly appreciated.

Planning to include this in v1.6.3 later today…

I see, the raw mode change does explain the fundamental issue, thanks.

One part of the issue is the latest changes to update OSDs in parallel. Before we updated OSDs in parallel, any time the osd-prepare job would run, it would update OSD deployments based on the latest prepare-job output, which would fix any issues caused by disks having different names after boot.

However, this would not prevent disks from failing for a time after the node was rebooted. I wonder if ceph-volume’s raw mode behavior differs from the lvm mode behavior. I am currently investigating the differences in lvm mode output.

[Update] ceph-volume in lvm mode outputs disks as /dev/mapper/MbEfX7-3xJF-R4Jm-Csz2-wfQ3-yGeq-VoXTwz devices which according to this page are consistent.

The root cause of this problem is that ceph-volume raw list does not output consistent device names (rather the user-friendly names). This shows up because switched to using raw mode in v1.6. This is further exacerbated due to Rook not being able to automatically recover from these types of failures because of the way we changed the OSD update process to update in parallel.

If you want to look at the ceph-volume output you need to run it on a machine with OSDs. It’s a local call that scans disks. It’s the source of truth, can you run it on a node and paste the output?

The tools container I executed ceph-volume in was running on the node the OSDs are on and was privileged. On the node itself I do not have ceph-volume binary available. Since we’re using a very basic image for the nodes (flatcar) I was not able to run ceph-volume directly on the node. I tried to use the rook image as https://github.com/kinvolk/toolbox but that results in the same empty result for “ceph-volume raw list” giving a “{}”.

It would also be helpful to get the CephCluster manifest as well as debug logs (ROOK_LOG_LEVEL=DEBUG) from the rook-ceph-operator-HHHHHH pod. You can attach those as files here.

Thanks, I attached logs and manifest here: rook.zip.