rook: OSD pod permissions broken, unable to open OSD superblock after node restart
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
osd pods fail to restart after node restart
Expected behavior:
OSD pods come back online
How to reproduce it (minimal and precise):
- Set up two pieces of storage – a partition on one drive (in my example, one that holds the OS) and a full drive
- Set up a cluster using
k0s(there are some caveats there) - Switch to ceph version 15.2.6 (
15.2.9does not work partitions are not properly considered by aceph volume/ceph batchcommand being run) - Install Rook 1.5.9
- Observe that the pieces all start correctly, PVCs create PVs, both are bound (though pods do not succeed because while attaching works, mounting fails, but that’s a separate issue I’m trying to debug), block device is present in Ceph Dashboard
- Restart the node
- Observe that all pods in
rook-cephcome up, except the OSD pods which are stuck (CrashLoopBackoff) - Observe the following in the logs:
$ k logs rook-ceph-osd-1-59765fcfcd-p5mp5
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40 0 set uid:gid to 167:167 (ceph:ceph)
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40 0 ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable), process ceph-osd, pid 1
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40 0 pidfile_write: ignore empty --pid-file
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (13) Permission denied
debug 2021-04-02T13:59:19.854+0000 7f76bce47f40 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file or directory
I thought this might be related to the udev rule issue that came up in the runners so I tried dropping in some udev rules (confirmed the the changed owner) and the errors persist. When I went to check the folder as the ceph user, it turns out the folder is empty:
$ sudo su -l ceph -s /bin/bash
ceph@all-in-one-01:~$ tree /var/lib/ceph/
/var/lib/ceph/
├── bootstrap-mds
├── bootstrap-mgr
├── bootstrap-osd
├── bootstrap-rbd
├── bootstrap-rbd-mirror
├── bootstrap-rgw
├── crash
│ └── posted
├── mds
├── mgr
├── mon
├── osd
└── tmp
13 directories, 0 files
Environment:
- OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
- Kernel (e.g.
uname -a):
Linux all-in-one-01 5.4.0-67-generic #75-Ubuntu SMP Fri Feb 19 18:03:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- Cloud provider or hardware configuration:
Hetzner, bare metal
- Rook version (use
rook versioninside of a Rook Pod):
1.5.9
- Storage backend version (e.g. for ceph do
ceph -v):
15.2.6
- Kubernetes version (use
kubectl version):
1.20
- Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
bare metal (provisioned by k0s)
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):
Dashboard is in HEALTH_WARN, but I assume they are benign for the following reasons:
PG_AVAILABILITY: Reduced data availability: 1 pg inactive- This makes sense because all cluster itself is inactive , there are no PVs
PG_DEGRADED: Degraded data redundancy: 1 pg undersized- This makes sense because 1 pg is actually “non HA” and does not have any redundancy
POOL_NO_REDUNDANCY: 1 pool(s) have no replicas configured- See previous note about the “undersized” pg
TOO_FEW_OSDS: OSD count 2 < osd_pool_default_size 3- There are only 2 things to put OSDs on (1 disk and 1 partition) in the whole “cluster”, this makes sense.
- I actually changed
osdsPerDeviceto 5 since these are NVMe disks
- PG_AVAILABILITY: Reduced data availability: 1 pg inactive
- PG_DEGRADED: Degraded data redundancy: 1 pg undersized
- POOL_NO_REDUNDANCY: 1 pool(s) have no replicas configured
- TOO_FEW_OSDS: OSD count 2 < osd_pool_default_size 3
With osdsPerDevice set to "5" all these warnings go away, except for the POOL_NO_REDUNDANCY (one of my pools is intentionally without redundancy), and OSDs start just fine the first time so I think these are indeed benign
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 22 (13 by maintainers)
@travisn i just tested v1.6.3 on arm64 based cluster running k0s and all pods in
rook-cephnamespace came back fine after reboot! things are looking very good so far, thanks 👍I had the same error, I figured out that I have to change
UIDandGIDof thecephuser in the/etc/passwdfile to match the values expected by the kubernetes ceph user, in my case are167:167. The change was fromto
Nope – I looked and I replaced my
ceph-volume lvm zapcommands withsgdiskcommands a while back, so I’m actually not using the packages at all.I don’t particularly need the packages, so I’ll take them out once the build is done – was this a part of the documentation I overlooked? If not, maybe updating it would be good because I don’t feel like I’ve seen this discussed before, I’ve looked through what feels like quite a few tickets (I hit an absurd amount of edge cases installing Rook this time around) and I don’t think I’ve seen the collision possibility mentioned before.
[EDIT] @galexrt I confirmed that the OSDs work when I remove the ceph packages and pre-created user. Everything works fine after reboot as well!
I’m good with this issue being closed, or I can make a small PR for a documentation change if it’s worth mentioning for those that pre-install
ceph(or have it on the machines they’re running on in general).Also, just a note in case it comes up in another issue, I did have to drop down to
15.2.6(15.2.9) to use a partition because of the recent changes in ceph that broke this.@sdeoras This sounds like #7878, which has a fix in progress
@t3hmrman Do you have any
ceph*packages installed on your nodes? E.g.,dpkg --list ceph*or so.Post the list and / or go ahead and remove the packages from the nodes.
OK so this is definitely the wrong way, but I figured out a way that works – I removed the
setuserandsetgrouparguments from the OSD pods, and they worked immediately. On disk, the files are still owned by UID/GID 167, so I triedrunAsUser: 167and that did not work.Removing
setuserandsetgroupinstantly fixed all the osd pods, and I’m getting logs like normal. I’m not sure exactly what’s going wrong withsetuid/setgidor the underlying mechanics here but I’m taking the working cluster for now:If I look at
/etc/passwd, thecephuser is mapped to a very different uid:That might actually be the problem here, so I gave it a shot and changed the
cephuser’s uid and gid:After this I reversed the changes (so put
--setuserand--setgroupback in all the OSD deployments), and that did not work. I did notice however that now the files on disk in/var/lib/rook/rook-ceph(the disks and metadata) actually have the proper user/group now thecephuser.I also tried using
runAsUser: 167andrunAsGroup: 167, which didn’t work either.Very confused on why the
cephuser isn’t working, but at this point, removing--setuserand--setgroupand letting therootuser be used for the OSDs is a workaroundPS: Looks like
--setuseris hardcoded in, so I started looking at writing some sort of MutatingAdmissionWebhookController… And then I restarted and the pods came right back up properly, with their original configuration (runAsUser: 0,--setuser cephand--setgroup ceph).PSS: I restarted a rollout as well for one of the OSDs and it came back up properly, and ownership of the folder is set to
cephrather than 167… The least hacky way to make this consistent might be to pre-create thecephsystem user. I saw some notes around the codebase about not being sure what uid thecephuser had but it seems pretty consistently to be167(and there’s at least one place where it’s hard coded).The only thing I might have done in between is run
chmod -R ceph /var/lib/rook/rook-cephandchgrp -R ceph /var/lib/rook/rook-cephbut given that Rook is likely to givecephthe UID/GID 167 and I’m pre-creating mycephuser with those IDs, I think things might work out fine and I can skip doing thegroupmod/usermod/chmod/chgrps.Pre-creating the ceph user might be the better fix, currently doing another from-scratch build