rook: Ceph monitors in crash loop

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: With new deployment, ceph mons keep crashing and cluster is never created completely or never in healthy state

Expected behavior: Rook operator and cluster deployment is completed successfully.

How to reproduce it (minimal and precise): It is difficult to say as same method or deployment is being done successfully on different kubernetes cluster however running different kubernetes release while hardware and host OS version is same.

Deployment method:

  • Both operator and cluster being deployed using helm charts with most of the values set to default specifically for ceph cluster while few non-default values in operator chart.

File(s) to submit:

  • Not entirely which logs would be relevant so can be provided upon request. However for crashing mon pods, the crash occurs right after this:
│ debug 2022-04-19T22:46:16.798+0000 7f2010d9c700  0 mon.a@0(leader) e5 handle_command mon_command({"prefix": "osd pool create", "format": "json", "pool": "device_ │
│ health_metrics", "pg_num": 1, "pg_num_min": 1} v 0) v1                                                                                                            │
│ debug 2022-04-19T22:46:16.798+0000 7f2010d9c700  0 log_channel(audit) log [INF] : from='mgr.54142 ' entity='mgr.a' cmd=[{"prefix": "osd pool create", "format": " │
│ json", "pool": "device_health_metrics", "pg_num": 1, "pg_num_min": 1}]: dispatch                                                                                  │
│ debug 2022-04-19T22:46:38.257+0000 7f20175a9700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) U │
│ ID: 0                                                                                                                                                             │
│ debug 2022-04-19T22:46:38.257+0000 7f20175a9700 -1 mon.a@0(leader) e5 *** Got Signal Terminated ***                                                               │
│ debug 2022-04-19T22:46:38.257+0000 7f20175a9700  1 mon.a@0(leader) e5 shutdown

so right after the leader mon tries to run osd pool create command, it crashes and that is livenessprobe kicks in whic results in pod shutting down. This happens to every mon pod which becomes leader after election.

If full mon logs are required then those can be provided and also from operator. I would like to understand the process that occurs exactly after that command especially from network connectivity perspective i.e if the mon requires connecting to different pod/service which could be failing? Based on my investigation, i could not find any network connectivity issue between any of the pods or services.

Environment:

  • OS : Fedora CoreOS 35.20220327.3.0
  • Kernel : 5.16.16-200.fc35.x86_64
  • Cloud provider or hardware configuration: Dell FC640/MX740c
  • Rook version: 1.8.8/1.9.0
  • Storage backend version : 16.2.7
  • Kubernetes version : 1.23.5
  • Kubernetes cluster type: Baremetal (self-managed/vanilla)
  • Storage backend status : Keeps timings out when mons are crashing and updated with number of mons down
  cluster:
    id:     fa1fee99-9448-4767-983c-f495633e3a7a
    health: HEALTH_WARN
            2/5 mons down, quorum b,c,d

  services:
    mon: 5 daemons, quorum  (age 25h), out of quorum: a, b, c, d, e
    mgr: b(active, since 2m), standbys: a
    osd: 6 osds: 6 up (since 3m), 6 in (since 3m)

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   29 MiB used, 4.9 TiB / 4.9 TiB avail
    pgs:

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 5
  • Comments: 57 (21 by maintainers)

Commits related to this issue

Most upvoted comments

Howdy,

did some investigation into this and the issue seems to be caused by this commit to systemd 240: https://github.com/systemd/systemd/commit/a8b627aaed409a15260c25988970c795bf963812

Before systemd v240, systemd would just leave fs.nr_open as-is because it had no mechanism to set a safe upper limit for it. The kernel hard-coded value for the default number of max open files is 1048576.

Starting from systemd v240, if you set LimitNOFILE=infinity in dockerd.service or containerd.service, this value in most cases will be set to ~1073741816 (INT_MAX for x86_64 divided by two).

Starting from this commit mentioned by @gpl: https://github.com/containerd/containerd/pull/4475/commits/c691c36614622b205a859ae7e656badb4a553076 containerd is using “infinity” i.e. ~1073741816.

This means that there are 3 orders of magnitude more file descriptors (that are potentially open) to iterate over and try to close them (or just set a CLOEXEC bit on them for them to be closed automatically upon fork() / exec()). This is why some people have seen some of the Rook clusters come back to life after a few days.

An easy fix is to just set LimitNOFILE in systemd service to, for example, 1048576 or any other number optimized for your use-case.

@gpl and I spent some time tracking down the offending code, and our patch appears to fix the described behaviour. I’ll PR it once I’ve handled the required Ceph bureaucracy; I’m currently blocked on having a Ceph tracker account, which requires manual approval by an administrator.

this works! (oracle linux 9.2, k8s 1.28.1, rook ceph 1.12.3) cat /etc/systemd/system/containerd.service.d/LimitNOFILE.conf [Service] LimitNOFILE=1048576

Got same issue on AlmaLinux 9.2 with systemd-252-13.el9_2.src.rpm and 6.1.28 kernel.

After trying to create pool monitor get 100% cpu on ms_dispatch:

$ ceph -s
  cluster:
    id:     738369d4-f7cf-11ed-b2ad-e41d2d291571
    health: HEALTH_WARN
            no active mgr
            1/3 mons down, quorum mb-5,mb-7

  services:
    mon: 3 daemons, quorum mb-5,mb-7 (age 92s), out of quorum: mb-6
    mgr: no daemons active (since 60s)
    osd: 12 osds: 9 up (since 24m), 9 in (since 11h)

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

How to fix?

Create file:

$ cat /etc/systemd/system/containerd.service.d/LimitNOFILE.conf
[Service]
LimitNOFILE=1048576

Run on every servers in cluster:

systemctl daemon-reload
systemctl restart containerd

systemctl restart ... mon

After all:

$ ceph -s
  cluster:
    id:     738369d4-f7cf-11ed-b2ad-e41d2d291571
    health: HEALTH_WARN
            1 pool(s) do not have an application enabled

  services:
    mon: 3 daemons, quorum mb-5,mb-6,mb-7 (age 4m)
    mgr: mb-6.pdgoko(active, since 5m), standbys: mb-5.cjxlid
    osd: 12 osds: 12 up (since 4m), 12 in (since 4m)

  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   143 MiB used, 1.2 TiB / 1.2 TiB avail
    pgs:     1 active+clean

  io:
    client:   767 B/s rd, 43 KiB/s wr, 0 op/s rd, 1 op/s wr

If the following procedure was completed, the Ceph cluster was able to operate normally with a multiple monitors. (The commands are the case of default Helm deployment.)

  1. After changing the number of monitors to 1, distribute the Ceph cluster through Helm, etc.
    helm upgrade --install "rook-ceph-cluster" "rook-ceph/rook-ceph-cluster" \
        --namespace "rook-ceph" \
        --set "cephClusterSpec.mon.count=1"
    
  2. Wait until all Ceph storageclass CRs are in the Connected or Ready state.
    kubectl --namespace "rook-ceph" get "cephblockpool" "ceph-blockpool" --output jsonpath --template '{.status.phase}'  # should be "Ready"
    kubectl --namespace "rook-ceph" get "cephfilesystem" "ceph-filesystem" --output jsonpath --template '{.status.phase}'  # should be "Ready"
    kubectl --namespace "rook-ceph" get "cephobjectstore" "ceph-objectstore" --output jsonpath --template '{.status.phase}'  # should be "Connected"
    kubectl --namespace "rook-ceph" get "cephcluster" "rook-ceph" --output jsonpath --template '{.status.phase}'  # should be "Ready"
    
    # more convenient way
    while :; do
        COMPLETED=1
        for CR in "blockpool" "filesystem" "objectstore" "cluster"; do
            PHASE=$(
                kubectl --namespace "rook-ceph" get "ceph$CR" \
                    --output jsonpath --template '{.items[0].status.phase}' \
                    2>/dev/null
            )
            case "$PHASE" in
            "Connected" | "Ready")
                continue
                ;;
            *)
                COMPLETED=0
                break
                ;;
            esac
        done
    
        if [ "$COMPLETED" -eq 1 ]; then
            break
        fi
    
        # pass some times
        sleep 5
    done
    
  3. If such all initial deployment tasks are completed, change the number of monitors to 3 or more and update the Ceph cluster again through Helm or the like.
    helm upgrade --install "rook-ceph-cluster" "rook-ceph/rook-ceph-cluster" \
        --namespace "rook-ceph" \
        --set "cephClusterSpec.mon.count=3"
    

In my experience, when the monitor performs long-time tasks related to mds, including osd pool create, there is a problem that the task cannot be completed due to various interrupt factors such as monitor leader election and healthCheck timeout. For this reason, forcing the number of monitors to 1 at the time of initial deployment seems to be the easiest solution at the moment.

It’s a little different from the current flow, but by using the cluster settings below, I was able to solve the problem of constantly restarting the monitor.


# All values below are taken from the CephCluster CRD
# More information can be found at [Ceph Cluster CRD](/Documentation/CRDs/ceph-cluster-crd.md)
cephClusterSpec:
  mon:
    # Set the number of mons to be started. Generally recommended to be 3.
    # For highest availability, an odd number of mons should be specified.
    count: 1
    # The mons should be on unique nodes. For production, at least 3 nodes are recommended for this reason.
    # Mons should only be allowed on the same node for test environments where data loss is acceptable.
    allowMultiplePerNode: false

  # The option to automatically remove OSDs that are out and are safe to destroy.
  removeOSDsIfOutAndSafeToRemove: false

  # Configure the healthcheck and liveness probes for ceph pods.
  # Valid values for daemons are 'mon', 'osd', 'status'
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
        timeout: 2h
      osd:
        disabled: false
        interval: 60s
        timeout: 2h
      status:
        disabled: false
        interval: 60s
        timeout: 2h
    # Change pod liveness probe, it works for all mon, mgr, and osd pods.
    livenessProbe:
      rgw:
        disabled: false
        probe:
          failureThreshold: 120
          initialDelaySeconds: 7200
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 60
      mds:
        disabled: false
        probe:
          failureThreshold: 120
          initialDelaySeconds: 7200
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 60
      mon:
        disabled: false
        probe:
          failureThreshold: 120
          initialDelaySeconds: 7200
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 60
      mgr:
        disabled: false
        probe:
          failureThreshold: 120
          initialDelaySeconds: 7200
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 60
      osd:
        disabled: false
        probe:
          failureThreshold: 120
          initialDelaySeconds: 7200
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 60
    startupProbe:
      rgw:
        disabled: true
      mds:
        disabled: true
      mon:
        disabled: true
      mgr:
        disabled: true
      osd:
        disabled: true

We are also encountering an issue with similar symptoms. We are using the Rook/Ceph quickstart with cluster-on-pvc.yaml on Azure (non-AKS!) with the official Azure CSI driver and fc36. Downgrading containerd to 1.5.X (e.g. 1.5.9) rook fully deploys. When using containerd 1.6.0 and above one mon is always out of quorum. Sadly, it isn’t isolated to containerd, since it works fine on e.g. GKE on the rapid release channel which uses containerd 1.6.6. Changing the kernel doesn’t seem to help, we tried 5.10, 5.15, and 5.18.

We can’t see any obvious issue in the logs, but we’re happy to provide logs for any combination of containerd/kernel/other packages.

I am also facing the same issue.

Lab Cluster on Hyper-V VMs running Rocky Linux 9, provisioned via kubespray 1.20.0. containerd as runtime. Tried CNIs calico and weave during troubleshooting and both experience the same behavior.

After being pointed to this issue I rebuilt the cluster with cri-o as container runtime. I can confirm that this mitigates the problem, the cluster is coming up in the first attempt.

Environment:

  • OS (e.g. from /etc/os-release): Rocky Linux 9.1 (Blue Onyx)
  • Kernel (e.g. uname -a): Linux 5.14.0-162.6.1.el9_1.0.1.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 28 18:44:09 UTC 2022 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: Hyper-V VMs
  • Rook version (use rook version inside of a Rook Pod): v1.10.7
  • Storage backend version (e.g. for ceph do ceph -v): 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
  • Kubernetes version (use kubectl version): 1.24.6
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): n/a
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): TIMEOUT alternating with HEALTH_WARN

Everything works fine when running as a container using cri-o

Create 3 mons normally, check the mon node process after initialization, you will find that ms_dispatch and fn_monstore will have cpu 100% problem, and pg pool is being created at this time. Quickly kill these two processes ceph cluster is normal

We also had that issue in our environment using Flatcar OS. A problem occurred when configuring a bare-metal cluster using Kubespray.

When deploying the cluster, configuring the containerd version containerd_version to v1.5.9, the problem was not solved.

  • In Kubespray v2.20.0, the default containerd version is v1.6.8.

Environment:

OS : Linux 5.15.70-flatcar
Kernel : 5.15.70-flatcar
Cloud provider or hardware configuration: x86_64 Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz GenuineIntel
Rook version: 1.10.3
Storage backend version :
Kubernetes version : v1.25.3
Kubernetes cluster type: Baremetal (self-managed/vanilla) - Kubespray v2.20.0