rook: After upgrade to ceph 18.2: Module 'crash' has failed: dictionary changed size during iteration
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
Cluster is in HEALTH_ERR (but still working normally otherwise).
The error is Module 'crash' has failed: dictionary changed size during iteration
The mgr pods are working fine (not crashlooping).
Expected behavior:
How to reproduce it (minimal and precise):
(unknown)
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary
Logs to submit:
-
Operator’s logs, if necessary
-
Crashing pod(s) logs, if necessary
To get logs, use
kubectl -n <namespace> logs <pod name>When pasting logs, always surround them with backticks or use theinsert codebutton from the Github UI. Read GitHub documentation if you need help.
Cluster Status to submit:
-
Output of krew commands, if necessary
To get the health of the cluster, use
kubectl rook-ceph healthTo get the status of the cluster, usekubectl rook-ceph ceph status
cluster:
id: 3a35629a-6129-4daf-9db6-36e0eda637c7
health: HEALTH_ERR
Module 'crash' has failed: dictionary changed size during iteration
29 pgs not deep-scrubbed in time
29 pgs not scrubbed in time
1512 mgr modules have recently crashed
services:
mon: 3 daemons, quorum o,af,ag (age 47h)
mgr: a(active, since 20h), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 12 osds: 12 up (since 46h), 12 in (since 5d)
data:
volumes: 1/1 healthy
pools: 4 pools, 97 pgs
objects: 4.07M objects, 1.9 TiB
usage: 5.8 TiB used, 88 TiB / 94 TiB avail
pgs: 65 active+clean
17 active+clean+snaptrim_wait
15 active+clean+snaptrim
io:
client: 6.3 MiB/s rd, 6.4 MiB/s wr, 11 op/s rd, 44 op/s wr
For more details, see the Rook Krew Plugin
Environment:
- OS (e.g. from /etc/os-release): Debian 11/12
- Kernel (e.g.
uname -a): 5.10.0-25-amd64 #1 SMP Debian 5.10.191-1 (2023-08-16) x86_64 GNU/Linux - Cloud provider or hardware configuration: bare-metal
- Rook version (use
rook versioninside of a Rook Pod): 1.12.4 - Storage backend version (e.g. for ceph do
ceph -v): 18.2.0 - Kubernetes version (use
kubectl version): 1.25.13 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 21 (11 by maintainers)
@travisn @leseb reading the crash module code, all the external methods do use the
with_crashesdecorator which blocks thecrashesdictionary before using it except the methoddo_post(which end up calling_refresh_health_checks). This method can alter the crashes dictionary without blocking. Not sure why thewith_crashesis not called in this case. It seems necessary otherwise somebody can call any crash listing method meanwhile the dictionary is being altered by this method so we get the errordictionary changed size during iterationIIRC the crash-collector tries various keyrings while posting crashes so sometimes the output can be confusing. It will print all the result from the key it tried, but the log are not clear which is being processed.
The following
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]is as bad as an error message can be. However, IIRC it means the keyring file present on the filesystem does not match the one inside the mon’s internal store. It’ll be good to see if the ceph-crash permission changed in the mon store with Reef.For the
dictionary changed size during iterationissue, I opened a fix https://github.com/ceph/ceph/pull/53711.