rook: rook-ceph-exporter crashing on v1.11.0
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
rook-ceph-exporter crashing on 2 nodes, these 2 nodes do not have OSDs
❯ k get po -n rook-ceph -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
...
rook-ceph-exporter-k8s-1-64fdc5b8b7-wpjkk 0/1 CrashLoopBackOff 545 (3m9s ago) 46h 192.168.42.11 k8s-1 <none> <none>
rook-ceph-exporter-k8s-2-6947bb7b8-l62bj 0/1 CrashLoopBackOff 545 (4m17s ago) 46h 192.168.42.12 k8s-2 <none> <none>
rook-ceph-exporter-k8s-3-75c4768b97-fqqlx 1/1 Running 4 (46h ago) 46h 192.168.42.13 k8s-3 <none> <none>
rook-ceph-exporter-k8s-4-5977d4894-d6hv2 1/1 Running 0 46h 192.168.42.14 k8s-4 <none> <none>
rook-ceph-exporter-k8s-5-78bff4f8db-88c2v 1/1 Running 0 46h 192.168.42.15 k8s-5 <none> <none>
...
Expected behavior:
For the pods to not crash or maybe not be present on these nodes?
How to reproduce it (minimal and precise):
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary
Logs to submit:
❯ k logs -n rook-ceph rook-ceph-exporter-k8s-1-64fdc5b8b7-wpjkk
Defaulted container "ceph-exporter" out of: ceph-exporter, chown-container-data-dir (init)
global_init: unable to open config file from search list /var/lib/rook/rook-ceph/rook-ceph.config
To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.
Cluster Status to submit:
- Output of krew commands, if necessary
❯ kubectl rook-ceph health
Info: Checking if at least three mon pods are running on different nodes
rook-ceph-mon-a-548f8978b6-rjtks 2/2 Running 0 46h
rook-ceph-mon-b-64d6c9c8d6-xmmc6 2/2 Running 0 46h
rook-ceph-mon-c-59bdc749d9-kdbfz 2/2 Running 0 46h
Info: Checking mon quorum and ceph health details
HEALTH_OK
Info: Checking if at least three osd pods are running on different nodes
rook-ceph-osd-0-549ccf9c5-swkdk 2/2 Running 0 46h
rook-ceph-osd-1-7c478df4f-ld7m5 2/2 Running 0 46h
rook-ceph-osd-2-659fccd446-xrpjh 2/2 Running 0 46h
Info: Pods that are in 'Running' status
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-7xxl4 2/2 Running 0 46h
csi-cephfsplugin-fnjpv 2/2 Running 0 46h
csi-cephfsplugin-fswft 2/2 Running 0 46h
csi-cephfsplugin-mptqx 2/2 Running 0 46h
csi-cephfsplugin-provisioner-797459f9bb-vxtmr 5/5 Running 0 46h
csi-cephfsplugin-provisioner-797459f9bb-zvmjp 5/5 Running 0 46h
csi-cephfsplugin-r5zvc 2/2 Running 0 46h
csi-cephfsplugin-wmhht 2/2 Running 0 46h
csi-rbdplugin-95857 2/2 Running 0 46h
csi-rbdplugin-hhsr2 2/2 Running 0 46h
csi-rbdplugin-p6q72 2/2 Running 0 46h
csi-rbdplugin-provisioner-d8cb566dc-bpw5r 5/5 Running 0 46h
csi-rbdplugin-provisioner-d8cb566dc-gbfks 5/5 Running 0 46h
csi-rbdplugin-rq6wh 2/2 Running 0 46h
csi-rbdplugin-sczzb 2/2 Running 0 46h
csi-rbdplugin-v4w6m 2/2 Running 0 46h
rook-ceph-crashcollector-k8s-1-576666d97c-2m49l 1/1 Running 0 46h
rook-ceph-crashcollector-k8s-2-7dbc8ddc4b-g6m2h 1/1 Running 0 46h
rook-ceph-crashcollector-k8s-3-765bb4759-bxh48 1/1 Running 0 46h
rook-ceph-crashcollector-k8s-4-5d7f47b968-wpr9s 1/1 Running 0 46h
rook-ceph-crashcollector-k8s-5-5d5b94bff6-hv4hq 1/1 Running 0 46h
rook-ceph-exporter-k8s-1-64fdc5b8b7-wpjkk 0/1 CrashLoopBackOff 547 (86s ago) 46h
rook-ceph-exporter-k8s-2-6947bb7b8-l62bj 0/1 CrashLoopBackOff 547 (2m36s ago) 46h
rook-ceph-exporter-k8s-3-75c4768b97-fqqlx 1/1 Running 4 (46h ago) 46h
rook-ceph-exporter-k8s-4-5977d4894-d6hv2 1/1 Running 0 46h
rook-ceph-exporter-k8s-5-78bff4f8db-88c2v 1/1 Running 0 46h
rook-ceph-mds-ceph-filesystem-a-58ffcfc468-t5rfw 2/2 Running 0 46h
rook-ceph-mds-ceph-filesystem-b-76bf878889-87k6p 2/2 Running 0 46h
rook-ceph-mgr-a-55878dbfdf-mngtv 2/2 Running 0 46h
rook-ceph-mgr-b-6679d96d7b-ktzz8 1/2 Running 0 46h
rook-ceph-mon-a-548f8978b6-rjtks 2/2 Running 0 46h
rook-ceph-mon-b-64d6c9c8d6-xmmc6 2/2 Running 0 46h
rook-ceph-mon-c-59bdc749d9-kdbfz 2/2 Running 0 46h
rook-ceph-operator-86cb6fdc9d-c6xxf 1/1 Running 1 (46h ago) 46h
rook-ceph-osd-0-549ccf9c5-swkdk 2/2 Running 0 46h
rook-ceph-osd-1-7c478df4f-ld7m5 2/2 Running 0 46h
rook-ceph-osd-2-659fccd446-xrpjh 2/2 Running 0 46h
rook-ceph-rgw-ceph-objectstore-a-5f4bccb7d7-5kmrr 2/2 Running 0 46h
rook-ceph-tools-6fddc74f44-644tg 1/1 Running 0 4d23h
Warning: Pods that are 'Not' in 'Running' status
NAME READY STATUS RESTARTS AGE
Info: checking placement group status
Info: 169 pgs: 169 active+clean; 333 GiB data, 665 GiB used, 2.1 TiB / 2.7 TiB avail; 852 B/s rd, 134 KiB/s wr, 16 op/s
Info: checking if at least one mgr pod is running
rook-ceph-mgr-a-55878dbfdf-mngtv Running k8s-2
rook-ceph-mgr-b-6679d96d7b-ktzz8 Running k8s-1
❯ kubectl rook-ceph ceph status
cluster:
id: 08100b31-1310-4691-b265-dc43665c92c9
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 46h)
mgr: a(active, since 46h), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 46h), 3 in (since 7w)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 226.74k objects, 333 GiB
usage: 665 GiB used, 2.1 TiB / 2.7 TiB avail
pgs: 169 active+clean
io:
client: 155 KiB/s rd, 91 KiB/s wr, 98 op/s rd, 5 op/s wr
Environment:
- OS (e.g. from /etc/os-release):
Ubuntu 22.04.2 LTS - Kernel (e.g.
uname -a):Linux k8s-5 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - Cloud provider or hardware configuration:
6 node k3s cluster with 1 OSD on 3 nodes each - Rook version (use
rook versioninside of a Rook Pod):1.11.0 - Storage backend version (e.g. for ceph do
ceph -v):ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) - Kubernetes version (use
kubectl version):1.26.1 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
k3s - Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):HEALTH_OK
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 17 (13 by maintainers)
You can see the pods have been restarting over 500 times since I updated to 1.11.
Looking into this a bit more, I do see that the exporter is expecting the path
/var/lib/rook/rook-ceph/rook-ceph.confto exist on the host path. That host path does not exist until an OSD is created on the node. Other daemons such as mon/mgr do not create that host path.As much as possible, the daemons should be able to run with other arguments instead of using the ceph.conf path. For example, these two env vars are used by most of the daemons when running in rook:
If the exporter has a requirement for the ceph.conf, then the exporter will need an init container to generate it under its own config dir, instead of relying on the one generated by the osd.
For the large crds issue, does
kubectl replacehelp resolve it? Hopefully this will fix it according to #11772.sorry missed that. Need to look into it, thanks for bringing it up. And can you try deleting the failing pods?