rook: rook-ceph-exporter crashing on v1.11.0

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

rook-ceph-exporter crashing on 2 nodes, these 2 nodes do not have OSDs

❯ k get po -n rook-ceph -o wide
NAME                                                READY   STATUS             RESTARTS          AGE     IP              NODE    NOMINATED NODE   READINESS GATES
...
rook-ceph-exporter-k8s-1-64fdc5b8b7-wpjkk           0/1     CrashLoopBackOff   545 (3m9s ago)    46h     192.168.42.11   k8s-1   <none>           <none>
rook-ceph-exporter-k8s-2-6947bb7b8-l62bj            0/1     CrashLoopBackOff   545 (4m17s ago)   46h     192.168.42.12   k8s-2   <none>           <none>
rook-ceph-exporter-k8s-3-75c4768b97-fqqlx           1/1     Running            4 (46h ago)       46h     192.168.42.13   k8s-3   <none>           <none>
rook-ceph-exporter-k8s-4-5977d4894-d6hv2            1/1     Running            0                 46h     192.168.42.14   k8s-4   <none>           <none>
rook-ceph-exporter-k8s-5-78bff4f8db-88c2v           1/1     Running            0                 46h     192.168.42.15   k8s-5   <none>           <none>
...

Expected behavior:

For the pods to not crash or maybe not be present on these nodes?

How to reproduce it (minimal and precise):

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

❯ k logs -n rook-ceph rook-ceph-exporter-k8s-1-64fdc5b8b7-wpjkk
Defaulted container "ceph-exporter" out of: ceph-exporter, chown-container-data-dir (init)
global_init: unable to open config file from search list /var/lib/rook/rook-ceph/rook-ceph.config

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read GitHub documentation if you need help.

Cluster Status to submit:

  • Output of krew commands, if necessary
❯ kubectl rook-ceph health
Info:  Checking if at least three mon pods are running on different nodes
rook-ceph-mon-a-548f8978b6-rjtks                    2/2     Running            0                 46h
rook-ceph-mon-b-64d6c9c8d6-xmmc6                    2/2     Running            0                 46h
rook-ceph-mon-c-59bdc749d9-kdbfz                    2/2     Running            0                 46h

Info:  Checking mon quorum and ceph health details
HEALTH_OK

Info:  Checking if at least three osd pods are running on different nodes
rook-ceph-osd-0-549ccf9c5-swkdk                     2/2     Running            0                 46h
rook-ceph-osd-1-7c478df4f-ld7m5                     2/2     Running            0                 46h
rook-ceph-osd-2-659fccd446-xrpjh                    2/2     Running            0                 46h

Info:  Pods that are in 'Running' status
NAME                                                READY   STATUS             RESTARTS          AGE
csi-cephfsplugin-7xxl4                              2/2     Running            0                 46h
csi-cephfsplugin-fnjpv                              2/2     Running            0                 46h
csi-cephfsplugin-fswft                              2/2     Running            0                 46h
csi-cephfsplugin-mptqx                              2/2     Running            0                 46h
csi-cephfsplugin-provisioner-797459f9bb-vxtmr       5/5     Running            0                 46h
csi-cephfsplugin-provisioner-797459f9bb-zvmjp       5/5     Running            0                 46h
csi-cephfsplugin-r5zvc                              2/2     Running            0                 46h
csi-cephfsplugin-wmhht                              2/2     Running            0                 46h
csi-rbdplugin-95857                                 2/2     Running            0                 46h
csi-rbdplugin-hhsr2                                 2/2     Running            0                 46h
csi-rbdplugin-p6q72                                 2/2     Running            0                 46h
csi-rbdplugin-provisioner-d8cb566dc-bpw5r           5/5     Running            0                 46h
csi-rbdplugin-provisioner-d8cb566dc-gbfks           5/5     Running            0                 46h
csi-rbdplugin-rq6wh                                 2/2     Running            0                 46h
csi-rbdplugin-sczzb                                 2/2     Running            0                 46h
csi-rbdplugin-v4w6m                                 2/2     Running            0                 46h
rook-ceph-crashcollector-k8s-1-576666d97c-2m49l     1/1     Running            0                 46h
rook-ceph-crashcollector-k8s-2-7dbc8ddc4b-g6m2h     1/1     Running            0                 46h
rook-ceph-crashcollector-k8s-3-765bb4759-bxh48      1/1     Running            0                 46h
rook-ceph-crashcollector-k8s-4-5d7f47b968-wpr9s     1/1     Running            0                 46h
rook-ceph-crashcollector-k8s-5-5d5b94bff6-hv4hq     1/1     Running            0                 46h
rook-ceph-exporter-k8s-1-64fdc5b8b7-wpjkk           0/1     CrashLoopBackOff   547 (86s ago)     46h
rook-ceph-exporter-k8s-2-6947bb7b8-l62bj            0/1     CrashLoopBackOff   547 (2m36s ago)   46h
rook-ceph-exporter-k8s-3-75c4768b97-fqqlx           1/1     Running            4 (46h ago)       46h
rook-ceph-exporter-k8s-4-5977d4894-d6hv2            1/1     Running            0                 46h
rook-ceph-exporter-k8s-5-78bff4f8db-88c2v           1/1     Running            0                 46h
rook-ceph-mds-ceph-filesystem-a-58ffcfc468-t5rfw    2/2     Running            0                 46h
rook-ceph-mds-ceph-filesystem-b-76bf878889-87k6p    2/2     Running            0                 46h
rook-ceph-mgr-a-55878dbfdf-mngtv                    2/2     Running            0                 46h
rook-ceph-mgr-b-6679d96d7b-ktzz8                    1/2     Running            0                 46h
rook-ceph-mon-a-548f8978b6-rjtks                    2/2     Running            0                 46h
rook-ceph-mon-b-64d6c9c8d6-xmmc6                    2/2     Running            0                 46h
rook-ceph-mon-c-59bdc749d9-kdbfz                    2/2     Running            0                 46h
rook-ceph-operator-86cb6fdc9d-c6xxf                 1/1     Running            1 (46h ago)       46h
rook-ceph-osd-0-549ccf9c5-swkdk                     2/2     Running            0                 46h
rook-ceph-osd-1-7c478df4f-ld7m5                     2/2     Running            0                 46h
rook-ceph-osd-2-659fccd446-xrpjh                    2/2     Running            0                 46h
rook-ceph-rgw-ceph-objectstore-a-5f4bccb7d7-5kmrr   2/2     Running            0                 46h
rook-ceph-tools-6fddc74f44-644tg                    1/1     Running            0                 4d23h

Warning:  Pods that are 'Not' in 'Running' status
NAME                                READY   STATUS      RESTARTS   AGE

Info:  checking placement group status
Info:  169 pgs: 169 active+clean; 333 GiB data, 665 GiB used, 2.1 TiB / 2.7 TiB avail; 852 B/s rd, 134 KiB/s wr, 16 op/s

Info:  checking if at least one mgr pod is running
rook-ceph-mgr-a-55878dbfdf-mngtv                    Running     k8s-2
rook-ceph-mgr-b-6679d96d7b-ktzz8                    Running     k8s-1
❯ kubectl rook-ceph ceph status
  cluster:
    id:     08100b31-1310-4691-b265-dc43665c92c9
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 46h)
    mgr: a(active, since 46h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 46h), 3 in (since 7w)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 226.74k objects, 333 GiB
    usage:   665 GiB used, 2.1 TiB / 2.7 TiB avail
    pgs:     169 active+clean

  io:
    client:   155 KiB/s rd, 91 KiB/s wr, 98 op/s rd, 5 op/s wr

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 22.04.2 LTS
  • Kernel (e.g. uname -a): Linux k8s-5 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: 6 node k3s cluster with 1 OSD on 3 nodes each
  • Rook version (use rook version inside of a Rook Pod): 1.11.0
  • Storage backend version (e.g. for ceph do ceph -v): ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
  • Kubernetes version (use kubectl version): 1.26.1
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): k3s
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 17 (13 by maintainers)

Most upvoted comments

You can see the pods have been restarting over 500 times since I updated to 1.11.

Looking into this a bit more, I do see that the exporter is expecting the path /var/lib/rook/rook-ceph/rook-ceph.conf to exist on the host path. That host path does not exist until an OSD is created on the node. Other daemons such as mon/mgr do not create that host path.

As much as possible, the daemons should be able to run with other arguments instead of using the ceph.conf path. For example, these two env vars are used by most of the daemons when running in rook:

      ROOK_CEPH_MON_HOST:           <set to the key 'mon_host' in secret 'rook-ceph-config'>  Optional: false
      CEPH_ARGS:                    -m $(ROOK_CEPH_MON_HOST)

If the exporter has a requirement for the ceph.conf, then the exporter will need an init container to generate it under its own config dir, instead of relying on the one generated by the osd.

For the large crds issue, does kubectl replace help resolve it? Hopefully this will fix it according to #11772.

You can see the pods have been restarting over 500 times since I updated to 1.11.

sorry missed that. Need to look into it, thanks for bringing it up. And can you try deleting the failing pods?