rook: Ceph monitors in crash loop
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: With new deployment, ceph mons keep crashing and cluster is never created completely or never in healthy state
Expected behavior: Rook operator and cluster deployment is completed successfully.
How to reproduce it (minimal and precise): It is difficult to say as same method or deployment is being done successfully on different kubernetes cluster however running different kubernetes release while hardware and host OS version is same.
Deployment method:
- Both operator and cluster being deployed using helm charts with most of the values set to default specifically for ceph cluster while few non-default values in operator chart.
File(s) to submit:
- Not entirely which logs would be relevant so can be provided upon request. However for crashing mon pods, the crash occurs right after this:
│ debug 2022-04-19T22:46:16.798+0000 7f2010d9c700 0 mon.a@0(leader) e5 handle_command mon_command({"prefix": "osd pool create", "format": "json", "pool": "device_ │
│ health_metrics", "pg_num": 1, "pg_num_min": 1} v 0) v1 │
│ debug 2022-04-19T22:46:16.798+0000 7f2010d9c700 0 log_channel(audit) log [INF] : from='mgr.54142 ' entity='mgr.a' cmd=[{"prefix": "osd pool create", "format": " │
│ json", "pool": "device_health_metrics", "pg_num": 1, "pg_num_min": 1}]: dispatch │
│ debug 2022-04-19T22:46:38.257+0000 7f20175a9700 -1 received signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) U │
│ ID: 0 │
│ debug 2022-04-19T22:46:38.257+0000 7f20175a9700 -1 mon.a@0(leader) e5 *** Got Signal Terminated *** │
│ debug 2022-04-19T22:46:38.257+0000 7f20175a9700 1 mon.a@0(leader) e5 shutdown
so right after the leader mon tries to run osd pool create command, it crashes and that is livenessprobe kicks in whic results in pod shutting down. This happens to every mon pod which becomes leader after election.
If full mon logs are required then those can be provided and also from operator. I would like to understand the process that occurs exactly after that command especially from network connectivity perspective i.e if the mon requires connecting to different pod/service which could be failing? Based on my investigation, i could not find any network connectivity issue between any of the pods or services.
Environment:
- OS : Fedora CoreOS 35.20220327.3.0
- Kernel : 5.16.16-200.fc35.x86_64
- Cloud provider or hardware configuration: Dell FC640/MX740c
- Rook version: 1.8.8/1.9.0
- Storage backend version : 16.2.7
- Kubernetes version : 1.23.5
- Kubernetes cluster type: Baremetal (self-managed/vanilla)
- Storage backend status : Keeps timings out when mons are crashing and updated with number of mons down
cluster:
id: fa1fee99-9448-4767-983c-f495633e3a7a
health: HEALTH_WARN
2/5 mons down, quorum b,c,d
services:
mon: 5 daemons, quorum (age 25h), out of quorum: a, b, c, d, e
mgr: b(active, since 2m), standbys: a
osd: 6 osds: 6 up (since 3m), 6 in (since 3m)
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 29 MiB used, 4.9 TiB / 4.9 TiB avail
pgs:
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 5
- Comments: 57 (21 by maintainers)
Commits related to this issue
- common: use close_range on Linux Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping. If we're not ... — committed to cofractal/ceph by edef1c a year ago
- common: use close_range on Linux Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping. If we're not ... — committed to cofractal/ceph by edef1c a year ago
- common: use close_range on Linux Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping. The glibc wra... — committed to cofractal/ceph by edef1c a year ago
- common: use close_range on Linux Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping. The glibc wra... — committed to cofractal/ceph by edef1c a year ago
- (profile::core::docker) limit dockerd to 100k fd on el9 This is a workaround / fix for ceph mons crashing on el9. See: https://github.com/rook/rook/issues/10110#issuecomment-1464898937 — committed to lsst-it/lsst-control by jhoblitt a year ago
- (profile::core::docker) limit dockerd to 100k fd on el9 This is a workaround / fix for ceph mons crashing on el9. See: https://github.com/rook/rook/issues/10110#issuecomment-1464898937 — committed to lsst-it/lsst-control by jhoblitt a year ago
- common: use close_range on Linux Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping. The glibc wra... — committed to cofractal/ceph by edef1c a year ago
Howdy,
did some investigation into this and the issue seems to be caused by this commit to systemd 240: https://github.com/systemd/systemd/commit/a8b627aaed409a15260c25988970c795bf963812
Before systemd v240, systemd would just leave
fs.nr_openas-is because it had no mechanism to set a safe upper limit for it. The kernel hard-coded value for the default number of max open files is1048576.Starting from systemd v240, if you set
LimitNOFILE=infinityindockerd.serviceorcontainerd.service, this value in most cases will be set to ~1073741816(INT_MAXfor x86_64 divided by two).Starting from this commit mentioned by @gpl: https://github.com/containerd/containerd/pull/4475/commits/c691c36614622b205a859ae7e656badb4a553076 containerd is using “infinity” i.e. ~
1073741816.This means that there are 3 orders of magnitude more file descriptors (that are potentially open) to iterate over and try to close them (or just set a
CLOEXECbit on them for them to be closed automatically uponfork()/exec()). This is why some people have seen some of the Rook clusters come back to life after a few days.An easy fix is to just set
LimitNOFILEin systemd service to, for example,1048576or any other number optimized for your use-case.@gpl and I spent some time tracking down the offending code, and our patch appears to fix the described behaviour. I’ll PR it once I’ve handled the required Ceph bureaucracy; I’m currently blocked on having a Ceph tracker account, which requires manual approval by an administrator.
this works! (oracle linux 9.2, k8s 1.28.1, rook ceph 1.12.3) cat /etc/systemd/system/containerd.service.d/LimitNOFILE.conf [Service] LimitNOFILE=1048576
Got same issue on AlmaLinux 9.2 with systemd-252-13.el9_2.src.rpm and 6.1.28 kernel.
After trying to create pool monitor get 100% cpu on ms_dispatch:
How to fix?
Create file:
Run on every servers in cluster:
After all:
If the following procedure was completed, the Ceph cluster was able to operate normally with a multiple monitors. (The commands are the case of default Helm deployment.)
ConnectedorReadystate.In my experience, when the monitor performs long-time tasks related to mds, including
osd pool create, there is a problem that the task cannot be completed due to various interrupt factors such as monitor leader election and healthCheck timeout. For this reason, forcing the number of monitors to 1 at the time of initial deployment seems to be the easiest solution at the moment.It’s a little different from the current flow, but by using the cluster settings below, I was able to solve the problem of constantly restarting the monitor.
We are also encountering an issue with similar symptoms. We are using the Rook/Ceph quickstart with cluster-on-pvc.yaml on Azure (non-AKS!) with the official Azure CSI driver and fc36. Downgrading containerd to 1.5.X (e.g. 1.5.9) rook fully deploys. When using containerd 1.6.0 and above one mon is always out of quorum. Sadly, it isn’t isolated to containerd, since it works fine on e.g. GKE on the rapid release channel which uses containerd 1.6.6. Changing the kernel doesn’t seem to help, we tried 5.10, 5.15, and 5.18.
We can’t see any obvious issue in the logs, but we’re happy to provide logs for any combination of containerd/kernel/other packages.
I am also facing the same issue.
Lab Cluster on Hyper-V VMs running Rocky Linux 9, provisioned via kubespray 1.20.0.
containerdas runtime. Tried CNIscalicoandweaveduring troubleshooting and both experience the same behavior.After being pointed to this issue I rebuilt the cluster with
cri-oas container runtime. I can confirm that this mitigates the problem, the cluster is coming up in the first attempt.Environment:
Rocky Linux 9.1 (Blue Onyx)uname -a):Linux 5.14.0-162.6.1.el9_1.0.1.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 28 18:44:09 UTC 2022 x86_64 GNU/LinuxHyper-V VMsrook versioninside of a Rook Pod):v1.10.7ceph -v):17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)kubectl version):1.24.6n/aceph healthin the Rook Ceph toolbox):TIMEOUTalternating withHEALTH_WARNEverything works fine when running as a container using cri-o
Create 3 mons normally, check the mon node process after initialization, you will find that ms_dispatch and fn_monstore will have cpu 100% problem, and pg pool is being created at this time. Quickly kill these two processes ceph cluster is normal
We also had that issue in our environment using
Flatcar OS. A problem occurred when configuring a bare-metal cluster usingKubespray.When deploying the cluster, configuring the containerd version
containerd_versionto v1.5.9, the problem was not solved.Kubesprayv2.20.0, the defaultcontainerdversion is v1.6.8.Environment: