rook: Ceph "Cluster utilization" chart no longer working on dashboard after v18.2.0 upgrade

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: When navigating to the homepage/dashboard in the Ceph Dashboard UI, the “Cluster utilization” chart is not longer functional. Charts consistently act as if they have no data.

Expected behavior: Dashboard chart loads and displays current values based on platform metrics.

How to reproduce it (minimal and precise):

While on the latest version of the Rook operator (deployed via helm), upgrade Ceph from v17 --> v18.2.0. No additional settings changes made, except image update.

File(s) to submit: Dashboard screenshot: Screenshot 2023-09-09 at 09-24-03 Ceph

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary
spec:
  cephVersion:
    allowUnsupported: false
    image: quay.io/ceph/ceph:v18.2.0
  cleanupPolicy:
    allowUninstallWithVolumes: false
    confirmation: ''
    sanitizeDisks:
      dataSource: zero
      iteration: 1
      method: quick
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  crashCollector:
    disable: false
  dashboard:
    enabled: true
    port: 8443
    ssl: false
  dataDirHostPath: /var/lib/rook
  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 0
  external: {}
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    livenessProbe:
      mgr:
        disabled: false
      mon:
        disabled: false
      osd:
        disabled: false
    startupProbe:
      mgr:
        disabled: false
      mon:
        disabled: false
      osd:
        disabled: false
  logCollector:
    enabled: true
    maxLogSize: 500M
    periodicity: daily
  mgr:
    allowMultiplePerNode: false
    count: 2
    modules:
      - enabled: true
        name: pg_autoscaler
  mon:
    allowMultiplePerNode: false
    count: 3
  monitoring:
    enabled: false
  network:
    connections:
      compression:
        enabled: false
      encryption:
        enabled: false
  priorityClassNames:
    mgr: system-cluster-critical
    mon: system-node-critical
    osd: system-node-critical
  removeOSDsIfOutAndSafeToRemove: false
  security:
    kms: {}
  skipUpgradeChecks: false
  storage:
    onlyApplyOSDPlacement: false
    useAllDevices: true
    useAllNodes: true
  waitTimeoutForHealthyOSDInMinutes: 10

Logs to submit: No known evidence of error in operator or manager logs.

Cluster Status to submit: n/a

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 22.04
  • Kernel (e.g. uname -a): Linux rke2-ceph-0 5.15.0-1041-kvm #46-Ubuntu SMP Fri Aug 25 07:39:11 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: Proxmox VMs
  • Rook version (use rook version inside of a Rook Pod): v1.12.3
  • Storage backend version (e.g. for ceph do ceph -v): v18.2.0
  • Kubernetes version (use kubectl version): v1.26.7
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): rke2
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

@nizamial09 Yes, Rook can set that value for the dashboard. It would be ideal if the operator could automatically detect this from a service with well-known prometheus labels. If that’s not possible, I’m thinking we would need a setting in the CephCluster CR. Even if we could detect the endpoint automatically in some scenarios, this setting could allow overriding the endpoint in case auto detection is not working correctly. For example, the setting could be dashboard.prometheusEndpoint. This setting should also be mentioned in Documentation/ceph-monitoring.md so users enabling prometheus can find it.

@rkachach Could you look into this?

do we need to do anything special for rook/ceph to push the metrics or is it automatically processed?

@electrical: In rook, there are some service-monitors that you’ll need to create which watches the cluster. https://rook.io/docs/rook/latest/Storage-Configuration/Monitoring/ceph-monitoring/#prometheus-instances

You can check the doc to get more info.

Thank you @nizamial09 seems we were already collecting the metrics but had to configure the dashboard to connect to the prometheus api. Thank you!

Okay, then probably this should be a bug in the dashboard. We’ll need to test this with the rook orch to see what went wrong there. While we do that, you can still use the old dashboard as your default. You can follow this doc which explains how to switch to your old dashboard, there is a Note section which says that. Or you can simply issue ceph dashboard feature disable dashboard from the toolbox to do that as well.

The landing page is still new and we are still improving it. Hopefully we’ll fix all these bugs and release a stable version of it soon.