rook: PVC mounts fail in v1.10.9 clusters when encryption is not enabled and kernel is older
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
Expected behavior: I was able to mount PVC’s with this new EC pool, but it now seems I can’t with all PVC’s I have.
How to reproduce it (minimal and precise):
Make the following: https://github.com/rook/rook/blob/master/deploy/examples/csi/rbd/storageclass-ec.yaml https://github.com/rook/rook/blob/master/deploy/examples/pool-ec.yaml
Logs to submit:
Pod logs:
MountVolume.MountDevice failed for volume "pvc-9fda23c7-6ed9-4287-b04c-2cee7cd20a34" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 22) occurred while running rbd args: [--id csi-rbd-node -m 10.43.112.135:6789,10.43.80.29:6789,10.43.48.191:6789,10.43.136.249:6789,10.43.74.119:6789 --keyfile=***stripped*** map replicapool/csi-vol-c84112c7-943a-11ed-8dae-6edda849eb24 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (22) Invalid argument
- Operator’s logs, if necessary
2023-01-15 12:29:56.339869 E \| ceph-block-pool-controller: failed to reconcile CephBlockPool "rook-ceph/ec-data-pool". failed to create pool "ec-data-pool".: failed to create pool "ec-data-pool".: failed to create pool "ec-data-pool": failed to create erasure code profile for pool "ec-data-pool": failed to look up default erasure code profile: failed to get erasure-code-profile for "default": exit status 1
--
Sun, Jan 15 2023 12:29:57 pm | 2023-01-15 12:29:57.318449 E \| ceph-block-pool-controller: failed to reconcile CephBlockPool "rook-ceph/replicated-metadata-pool". failed to create pool "replicated-metadata-pool".: failed to create pool "replicated-metadata-pool".: failed to create pool "replicated-metadata-pool": failed to create replicated crush rule "replicated-metadata-pool": failed to create crush rule replicated-metadata-pool: exit status 1
Sun, Jan 15 2023 12:29:58 pm | 2023-01-15 12:29:58.530240 E \| ceph-block-pool-controller: failed to reconcile CephBlockPool "rook-ceph/ec-data-pool". failed to create pool "ec-data-pool".: failed to create pool "ec-data-pool".: failed to create pool "ec-data-pool": failed to create erasure code profile for pool "ec-data-pool": failed to look up default erasure code profile: failed to get erasure-code-profile for "default": exit status 1
```
**Cluster Status to submit**:
```
cluster:
id: cb794671-ea06-4e89-8661-ce00ba0134d5
health: HEALTH_WARN
clock skew detected on mon.bl
mon bc is low on available space
40 daemons have recently crashed
2 mgr modules have recently crashed
services:
mon: 5 daemons, quorum ay,bc,bf,bg,bl (age 2h)
mgr: b(active, since 50m), standbys: a
mds: 2/2 daemons up, 2 hot standby
osd: 6 osds: 6 up (since 35m), 6 in (since 17h); 13 remapped pgs
data:
volumes: 2/2 healthy
pools: 11 pools, 88 pgs
objects: 3.87M objects, 984 GiB
usage: 2.8 TiB used, 2.5 TiB / 5.3 TiB avail
pgs: 603009/11623113 objects misplaced (5.188%)
75 active+clean
11 active+remapped+backfilling
2 active+clean+remapped
io:
client: 1.6 MiB/s rd, 2.7 MiB/s wr, 620 op/s rd, 172 op/s wr
recovery: 848 KiB/s, 14 objects/s
progress:
Global Recovery Event (45m)
[========================....] (remaining: 6m)
```
**Environment**:
* OS (e.g. from /etc/os-release): Ubuntu 21
* Kernel (e.g. `uname -a`): 20.04.1-Ubuntu
* Cloud provider or hardware configuration:
* Rook version (use `rook version` inside of a Rook Pod): rook: v1.10.8
* Storage backend version (e.g. for ceph do `ceph -v`): ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
* Kubernetes version (use `kubectl version`): 1.23
* Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE 1
* Storage backend status (e.g. for Ceph use `ceph health` in the [Rook Ceph toolbox](https://rook.io/docs/rook/latest-release/Troubleshooting/ceph-toolbox/#interactive-toolbox)): HEALTH_WARN clock skew detected on mon.bl; mon bc is low on available space; 40 daemons have recently crashed; 2 mgr modules have recently crashed
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 18 (5 by maintainers)
@Madhu-1 , I think the code at line 605 below is not checking
.Enabledsettings but existence ofc.Spec.Network.Connections.Encryptionsetting.@travisn is the the intention to enable this by default and have users remove
c.Spec.Network.Connections.Encryptionsetting if not required ?I think earlier kernel versions do not have support
libceph: bad option at 'ms_mode=prefer-crc'as mentioned by @caylahttps://github.com/rook/rook/blob/91beb549be3720a1278c3b5c67934f46555e1db5/pkg/operator/ceph/cluster/cluster.go#L605-L625
Update: upgrading the kernel to 5.11+ (5.15 in my case) resolved the RBD mounting issue described in this comment.
–
Ahha
Thanks for confirming the theory I am working on right now.
I had the same issue as you after upgrading from 1.9.10 to 1.10.9.
I suspect it was triggered (intentionally or not) by https://github.com/rook/rook/pull/11523
I was digging around and saw this comment:
https://github.com/rook/rook/blob/ae408e354443ab8af9ade5768371e7ac82c1233c/deploy/examples/cluster.yaml#L78-L86
Even though encryption isn’t enabled by default, I think this is now a hard requirement.
I am installing
linux-generic-hwe-20.04(5.15.0-58-generic) on my nodes and rebooting to confirm it fixes the issue, but your comment gives me a lot of hope.FWIW, this was my debugging path:
New PVCs were failing to mount. Existing, mounted PVCs were fine.
New pods showed the following event:
This was echo’d in the
csi-rbdplugin:dmesgon the node showed:Googling for that led to me to the aforementioned merged PR.
I was trying to reckon how this config passed down and saw:
and finally made my way back to the
cluster.yamlwhere I saw the comment about the kernel version and inspiration struck.I am not a rook or ceph expert by any means, but writing it all out here for anyone else that stumbles on this.
@Rakshith-R Agreed that you’ve identified the issue. Due to that setting not being available in older kernels, we should not set that if encryption is not enabled. However, if encryption had been enabled and now is being disabled, rook should remove that setting from the mon store. So it seems that if encryption is not enabled, rook needs to query the mon store to see if it was set and needs to be disabled.
Can confirm the issue around mounting was resolved with installing the linux-generic-hwe-20.04 dependency… 😃
I noted it in an edit above, but in case it was missed, upgrading the kernel to 5.11+ completely resolved my issues. I have a fat and happy ceph cluster once again.
I am not using EC, so yeah, it being multiple issues is a distinct possibility.