rook: After upgrade to ceph 18.2: Module 'crash' has failed: dictionary changed size during iteration

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: Cluster is in HEALTH_ERR (but still working normally otherwise). The error is Module 'crash' has failed: dictionary changed size during iteration The mgr pods are working fine (not crashlooping).

Expected behavior:

How to reproduce it (minimal and precise):

(unknown)

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

  • Operator’s logs, if necessary

  • Crashing pod(s) logs, if necessary

    To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read GitHub documentation if you need help.

Cluster Status to submit:

  • Output of krew commands, if necessary

    To get the health of the cluster, use kubectl rook-ceph health To get the status of the cluster, use kubectl rook-ceph ceph status

  cluster:
    id:     3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_ERR
            Module 'crash' has failed: dictionary changed size during iteration
            29 pgs not deep-scrubbed in time
            29 pgs not scrubbed in time
            1512 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum o,af,ag (age 47h)
    mgr: a(active, since 20h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 12 osds: 12 up (since 46h), 12 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.07M objects, 1.9 TiB
    usage:   5.8 TiB used, 88 TiB / 94 TiB avail
    pgs:     65 active+clean
             17 active+clean+snaptrim_wait
             15 active+clean+snaptrim

  io:
    client:   6.3 MiB/s rd, 6.4 MiB/s wr, 11 op/s rd, 44 op/s wr

For more details, see the Rook Krew Plugin

Environment:

  • OS (e.g. from /etc/os-release): Debian 11/12
  • Kernel (e.g. uname -a): 5.10.0-25-amd64 #1 SMP Debian 5.10.191-1 (2023-08-16) x86_64 GNU/Linux
  • Cloud provider or hardware configuration: bare-metal
  • Rook version (use rook version inside of a Rook Pod): 1.12.4
  • Storage backend version (e.g. for ceph do ceph -v): 18.2.0
  • Kubernetes version (use kubectl version): 1.25.13
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 21 (11 by maintainers)

Most upvoted comments

@travisn @leseb reading the crash module code, all the external methods do use the with_crashes decorator which blocks the crashes dictionary before using it except the method do_post (which end up calling _refresh_health_checks). This method can alter the crashes dictionary without blocking. Not sure why the with_crashes is not called in this case. It seems necessary otherwise somebody can call any crash listing method meanwhile the dictionary is being altered by this method so we get the error dictionary changed size during iteration

IIRC the crash-collector tries various keyrings while posting crashes so sometimes the output can be confusing. It will print all the result from the key it tried, but the log are not clear which is being processed.

The following handle_auth_bad_method server allowed_methods [2] but i only support [2,1] is as bad as an error message can be. However, IIRC it means the keyring file present on the filesystem does not match the one inside the mon’s internal store. It’ll be good to see if the ceph-crash permission changed in the mon store with Reef.

For the dictionary changed size during iteration issue, I opened a fix https://github.com/ceph/ceph/pull/53711.