rook: Pool creation makes mon deamons unresponsive

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: If a Pool should be created, either by ceph CLI or via a K8s Object, the Cluster will delegate the task to a mon, as it seems, and it will stop responding emedeately. The Pool creation will get done eventually in the most cases anyway, but it takes at least hours, mostly days. Also the Ceph CLI commands are unresponsive in this time.

Expected behavior: Pool will be created in an adequate time and the Mon is not getting unresponsive.

How to reproduce it (minimal and precise): Install the Cluster and try to create a Pool, even the creation of the .mgr Pool is prone to this.

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary
Ceph Cluster CRD ``` Name: rook-ceph Namespace: rook-ceph Labels: <none> Annotations: <none> API Version: ceph.rook.io/v1 Kind: CephCluster Metadata: Creation Timestamp: 2023-01-29T21:52:47Z Finalizers: cephcluster.ceph.rook.io Generation: 4 Managed Fields: API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:cephVersion: .: f:image: f:cleanupPolicy: .: f:sanitizeDisks: .: f:dataSource: f:iteration: f:method: f:crashCollector: f:dashboard: .: f:enabled: f:ssl: f:dataDirHostPath: f:disruptionManagement: .: f:machineDisruptionBudgetNamespace: f:managePodBudgets: f:osdMaintenanceTimeout: f:healthCheck: .: f:daemonHealth: .: f:mon: f:osd: f:status: f:livenessProbe: .: f:mgr: f:mon: f:osd: f:startupProbe: .: f:mgr: f:mon: f:osd: f:logCollector: .: f:enabled: f:maxLogSize: f:periodicity: f:mgr: .: f:count: f:modules: f:mon: f:monitoring: f:network: .: f:connections: .: f:compression: f:encryption: f:priorityClassNames: .: f:mgr: f:mon: f:osd: f:storage: .: f:useAllDevices: f:useAllNodes: f:waitTimeoutForHealthyOSDInMinutes: Manager: kubectl-create Operation: Update Time: 2023-01-29T21:52:47Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:"cephcluster.ceph.rook.io": f:spec: f:external: f:healthCheck: f:daemonHealth: f:osd: f:interval: f:status: f:interval: f:security: .: f:kms: Manager: rook Operation: Update Time: 2023-01-29T21:52:47Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: f:healthCheck: f:daemonHealth: f:mon: f:interval: f:timeout: f:livenessProbe: f:mon: f:probe: .: f:timeoutSeconds: f:startupProbe: f:mon: f:probe: .: f:timeoutSeconds: f:mon: f:count: Manager: kubectl-edit Operation: Update Time: 2023-01-31T20:40:42Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:ceph: .: f:capacity: .: f:bytesAvailable: f:bytesTotal: f:bytesUsed: f:lastUpdated: f:details: .: f:error: .: f:message: f:severity: f:health: f:lastChanged: f:lastChecked: f:previousHealth: f:conditions: f:message: f:observedGeneration: f:phase: f:state: f:storage: .: f:deviceClasses: f:version: .: f:image: f:version: Manager: rook Operation: Update Subresource: status Time: 2023-02-01T08:45:38Z Resource Version: 5708003 UID: e684f005-9d2f-47bd-b5f2-ec12e474f627 Spec: Ceph Version: Image: quay.io/ceph/ceph:v17.2.5 Cleanup Policy: Sanitize Disks: Data Source: zero Iteration: 1 Method: quick Crash Collector: Dashboard: Enabled: true Ssl: true Data Dir Host Path: /var/lib/rook Disruption Management: Machine Disruption Budget Namespace: openshift-machine-api Manage Pod Budgets: true Osd Maintenance Timeout: 30 External: Health Check: Daemon Health: Mon: Interval: 1m Timeout: 2h Osd: Interval: 1m0s Status: Interval: 1m0s Liveness Probe: Mgr: Mon: Probe: Timeout Seconds: 2400 Osd: Startup Probe: Mgr: Mon: Probe: Timeout Seconds: 2400 Osd: Log Collector: Enabled: true Max Log Size: 500M Periodicity: daily Mgr: Count: 2 Modules: Enabled: true Name: pg_autoscaler Mon: Count: 4 Monitoring: Network: Connections: Compression: Encryption: Priority Class Names: Mgr: system-cluster-critical Mon: system-node-critical Osd: system-node-critical Security: Kms: Storage: Use All Devices: true Use All Nodes: true Wait Timeout For Healthy OSD In Minutes: 10 Status: Ceph: Capacity: Bytes Available: 122405584896 Bytes Total: 122413027328 Bytes Used: 7442432 Last Updated: 2023-01-31T12:20:59Z Details: Error: Message: failed to get status. . timed out: exit status 1 Severity: Urgent Health: HEALTH_ERR Last Changed: 2023-02-01T02:43:37Z Last Checked: 2023-02-01T08:45:22Z Previous Health: HEALTH_WARN Conditions: Last Heartbeat Time: 2023-01-31T22:32:53Z Last Transition Time: 2023-01-31T22:32:53Z Message: Configuring Ceph OSDs Reason: ClusterProgressing Status: True Type: Progressing Last Heartbeat Time: 2023-02-01T08:45:38Z Last Transition Time: 2023-02-01T08:42:36Z Message: Failed to configure ceph cluster Reason: ClusterCreated Status: False Type: Ready Message: Failed to configure ceph cluster Observed Generation: 2 Phase: Ready State: Error Storage: Device Classes: Name: ssd Name: hdd Version: Image: quay.io/ceph/ceph:v17.2.5 Version: 17.2.5-0 Events: <none> ```

Logs to submit: I dont see any Logs which state errors here, except them refering to the unresponsiveness of the mon Pod.

Cluster Status to submit: I dont get a Status right now, because a Pool creation is running. But before I started the pool creation, it was “HEALTH_OK”.

Environment:

  • OS (e.g. from /etc/os-release): Gentoo Linux
  • Kernel (e.g. uname -a): 5.15.75
  • Cloud provider or hardware configuration: Hardware, 6 RockPi 4B SBC with NVMEs on the PCIe Lane.
  • Rook version (use rook version inside of a Rook Pod): 1.10.10
  • Storage backend version (e.g. for ceph do ceph -v): 17.2.5
  • Kubernetes version (use kubectl version): 1.26.0
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Kubeadm
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (7 by maintainers)

Most upvoted comments

I know that creating host based Pools on the HDD can’t work in this topology. But it will on the ssds, I think. And I would like to have the .mgr Pool on those SSDs, to have that SPOF only where it needs to be. On the other hand, if your theory should be true, why does ceph try to create a pool where it isnt going to work obviously?