rook: OSD pod permissions broken, unable to open OSD superblock after node restart

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

osd pods fail to restart after node restart

Expected behavior:

OSD pods come back online

How to reproduce it (minimal and precise):

  • Set up two pieces of storage – a partition on one drive (in my example, one that holds the OS) and a full drive
  • Set up a cluster using k0s (there are some caveats there)
  • Switch to ceph version 15.2.6 (15.2.9 does not work partitions are not properly considered by a ceph volume/ceph batch command being run)
  • Install Rook 1.5.9
  • Observe that the pieces all start correctly, PVCs create PVs, both are bound (though pods do not succeed because while attaching works, mounting fails, but that’s a separate issue I’m trying to debug), block device is present in Ceph Dashboard
  • Restart the node
  • Observe that all pods in rook-ceph come up, except the OSD pods which are stuck (CrashLoopBackoff)
  • Observe the following in the logs:
$ k logs rook-ceph-osd-1-59765fcfcd-p5mp5
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40  0 set uid:gid to 167:167 (ceph:ceph)
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40  0 ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable), process ceph-osd, pid 1
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40  0 pidfile_write: ignore empty --pid-file
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (13) Permission denied
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file or directory

I thought this might be related to the udev rule issue that came up in the runners so I tried dropping in some udev rules (confirmed the the changed owner) and the errors persist. When I went to check the folder as the ceph user, it turns out the folder is empty:

$ sudo su -l ceph -s /bin/bash
ceph@all-in-one-01:~$ tree /var/lib/ceph/
/var/lib/ceph/
├── bootstrap-mds
├── bootstrap-mgr
├── bootstrap-osd
├── bootstrap-rbd
├── bootstrap-rbd-mirror
├── bootstrap-rgw
├── crash
│   └── posted
├── mds
├── mgr
├── mon
├── osd
└── tmp

13 directories, 0 files

Environment:

  • OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
  • Kernel (e.g. uname -a):
Linux all-in-one-01 5.4.0-67-generic #75-Ubuntu SMP Fri Feb 19 18:03:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration:

Hetzner, bare metal

  • Rook version (use rook version inside of a Rook Pod):
1.5.9
  • Storage backend version (e.g. for ceph do ceph -v):
15.2.6
  • Kubernetes version (use kubectl version):

1.20

  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):

bare metal (provisioned by k0s)

Dashboard is in HEALTH_WARN, but I assume they are benign for the following reasons:

  • PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    • This makes sense because all cluster itself is inactive , there are no PVs
  • PG_DEGRADED: Degraded data redundancy: 1 pg undersized
    • This makes sense because 1 pg is actually “non HA” and does not have any redundancy
  • POOL_NO_REDUNDANCY: 1 pool(s) have no replicas configured
    • See previous note about the “undersized” pg
  • TOO_FEW_OSDS: OSD count 2 < osd_pool_default_size 3
    • There are only 2 things to put OSDs on (1 disk and 1 partition) in the whole “cluster”, this makes sense.
    • I actually changed osdsPerDevice to 5 since these are NVMe disks
- PG_AVAILABILITY: Reduced data availability: 1 pg inactive
- PG_DEGRADED: Degraded data redundancy: 1 pg undersized
- POOL_NO_REDUNDANCY: 1 pool(s) have no replicas configured
- TOO_FEW_OSDS: OSD count 2 < osd_pool_default_size 3

With osdsPerDevice set to "5" all these warnings go away, except for the POOL_NO_REDUNDANCY (one of my pools is intentionally without redundancy), and OSDs start just fine the first time so I think these are indeed benign

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (13 by maintainers)

Most upvoted comments

@travisn i just tested v1.6.3 on arm64 based cluster running k0s and all pods in rook-ceph namespace came back fine after reboot! things are looking very good so far, thanks 👍

I had the same error, I figured out that I have to change UID and GID of the ceph user in the /etc/passwd file to match the values expected by the kubernetes ceph user, in my case are 167:167. The change was from

ceph:x:64045:64045:Ceph storage service:/var/lib/ceph:/usr/sbin/nologin

to

ceph:x:167:167:Ceph storage service:/var/lib/ceph:/usr/sbin/nologin

Nope – I looked and I replaced my ceph-volume lvm zap commands with sgdisk commands a while back, so I’m actually not using the packages at all.

I don’t particularly need the packages, so I’ll take them out once the build is done – was this a part of the documentation I overlooked? If not, maybe updating it would be good because I don’t feel like I’ve seen this discussed before, I’ve looked through what feels like quite a few tickets (I hit an absurd amount of edge cases installing Rook this time around) and I don’t think I’ve seen the collision possibility mentioned before.

[EDIT] @galexrt I confirmed that the OSDs work when I remove the ceph packages and pre-created user. Everything works fine after reboot as well!

I’m good with this issue being closed, or I can make a small PR for a documentation change if it’s worth mentioning for those that pre-install ceph (or have it on the machines they’re running on in general).

Also, just a note in case it comes up in another issue, I did have to drop down to 15.2.6 (15.2.9) to use a partition because of the recent changes in ceph that broke this.

@sdeoras This sounds like #7878, which has a fix in progress

@t3hmrman Do you have any ceph* packages installed on your nodes? E.g., dpkg --list ceph* or so.

Post the list and / or go ahead and remove the packages from the nodes.

OK so this is definitely the wrong way, but I figured out a way that works – I removed the setuser and setgroup arguments from the OSD pods, and they worked immediately. On disk, the files are still owned by UID/GID 167, so I tried runAsUser: 167 and that did not work.

Removing setuser and setgroup instantly fixed all the osd pods, and I’m getting logs like normal. I’m not sure exactly what’s going wrong with setuid/setgid or the underlying mechanics here but I’m taking the working cluster for now:

If I look at /etc/passwd, the ceph user is mapped to a very different uid:

ceph:x:64045:64045:Ceph storage service:/var/lib/ceph:/usr/sbin/nologin

That might actually be the problem here, so I gave it a shot and changed the ceph user’s uid and gid:

root@all-in-one-01 ~ # groupmod -g 167 ceph
root@all-in-one-01 ~ # usermod -u 167 -g 167 ceph
root@all-in-one-01 ~ # id ceph
uid=167(ceph) gid=167(ceph) groups=167(ceph)

After this I reversed the changes (so put --setuser and --setgroup back in all the OSD deployments), and that did not work. I did notice however that now the files on disk in /var/lib/rook/rook-ceph (the disks and metadata) actually have the proper user/group now the ceph user.

I also tried using runAsUser: 167 and runAsGroup: 167, which didn’t work either.

Very confused on why the ceph user isn’t working, but at this point, removing --setuser and --setgroup and letting the root user be used for the OSDs is a workaround

PS: Looks like --setuser is hardcoded in, so I started looking at writing some sort of MutatingAdmissionWebhookController… And then I restarted and the pods came right back up properly, with their original configuration (runAsUser: 0, --setuser ceph and --setgroup ceph).

PSS: I restarted a rollout as well for one of the OSDs and it came back up properly, and ownership of the folder is set to ceph rather than 167… The least hacky way to make this consistent might be to pre-create the ceph system user. I saw some notes around the codebase about not being sure what uid the ceph user had but it seems pretty consistently to be 167 (and there’s at least one place where it’s hard coded).

The only thing I might have done in between is run chmod -R ceph /var/lib/rook/rook-ceph and chgrp -R ceph /var/lib/rook/rook-ceph but given that Rook is likely to give ceph the UID/GID 167 and I’m pre-creating my ceph user with those IDs, I think things might work out fine and I can skip doing the groupmod/usermod/chmod/chgrps.

Pre-creating the ceph user might be the better fix, currently doing another from-scratch build