rook: regular rook-ceph-mgr crashes after update to ceph v17.2.0: too old resource version: 383042908 (383043544)

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: Since the update to ceph v17.2.0 we have regular (every 1-2 hours) crashes of the rook module in ceph.

Expected behavior:

How to reproduce it (minimal and precise):

File(s) to submit:

  • cluster-on-pvc.yml:
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph # namespace:cluster
spec:
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
    volumeClaimTemplate:
      spec:
        storageClassName: local-storage-fs
        resources:
          requests:
            storage: 10Gi
  cephVersion:
    image: ceph/ceph:v17.2.0-20220611
    allowUnsupported: false
  skipUpgradeChecks: false
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  mgr:
    count: 1
    modules:
      - name: pg_autoscaler
        enabled: true
  dashboard:
    enabled: true
    ssl: true
  monitoring:
    enabled: true
    rulesNamespace: monitoring
  crashCollector:
    disable: false
  storage:
    storageClassDeviceSets:
    - name: set1
      count: 4
      portable: false
      tuneDeviceClass: false
      tuneFastDeviceClass: false
      encrypted: false
      placement:
        topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd
      preparePlacement:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd-prepare
              topologyKey: kubernetes.io/hostname
        topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd-prepare
      resources:
        limits:
          cpu: "500m"
          memory: "4Gi"
        requests:
          cpu: "500m"
          memory: "4Gi"
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          resources:
            requests:
              storage: 730Gi
          storageClassName: local-storage-block
          volumeMode: Block
          accessModes:
            - ReadWriteOnce
  resources: []
  priorityClassNames:
    mon: system-node-critical
    osd: system-node-critical
    mgr: system-cluster-critical
  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 0
    manageMachineDisruptionBudgets: false
    machineDisruptionBudgetNamespace: openshift-machine-api
  • ceph crash log
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/rook/module.py\", line 215, in serve\n    self._apply_drivegroups(list(self._drive_group_map.values()))",
        "  File \"/usr/share/ceph/mgr/rook/module.py\", line 591, in _apply_drivegroups\n    all_hosts = raise_if_exception(self.get_hosts())",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 228, in raise_if_exception\n    raise e",
        "kubernetes.client.rest.ApiException: ({'type': 'ERROR', 'object': {'api_version': 'v1',\n 'kind': 'Status',\n 'metadata': {'annotations': None,\n              'cluster_name': None,\n              'creation_timestamp': None,\n              'deletion_grace_period_seconds': None,\n              'deletion_timestamp': None,\n              'finalizers': None,\n              'generate_name': None,\n              'generation': None,\n              'initializers': None,\n              'labels': None,\n              'managed_fields': None,\n              'name': None,\n              'namespace': None,\n              'owner_references': None,\n              'resource_version': None,\n              'self_link': None,\n              'uid': None},\n 'spec': None,\n 'status': {'addresses': None,\n            'allocatable': None,\n            'capacity': None,\n            'conditions': None,\n            'config': None,\n            'daemon_endpoints': None,\n            'images': None,\n            'node_info': None,\n            'phase': None,\n            'volumes_attached': None,\n            'volumes_in_use': None}}, 'raw_object': {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 383042908 (383043544)', 'reason': 'Expired', 'code': 410}})\nReason: None\n"
    ],
    "ceph_version": "17.2.0",
    "crash_id": "2022-06-15T06:37:23.677990Z_a2a89ea9-1313-4271-91d3-d90d067548c2",
    "entity_name": "mgr.a",
    "mgr_module": "rook",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "ApiException",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "cf609e0280dc50d3bc27b2814523d911b9d5af3c081b60d3182d78e14f834030",
    "timestamp": "2022-06-15T06:37:23.677990Z",
    "utsname_hostname": "rook-ceph-mgr-a-996b8c79b-l528b",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-89-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021"
}

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 20.04.4 LTS
  • Kernel (e.g. uname -a): Linux 5.4.0-89-generic #100-Ubuntu
  • Cloud provider or hardware configuration: 3 control plane nodes, 4 worker nodes, rook storage on PVC (lvp), setup via kubeadm (CRI-O, Cilium)
  • Rook version (use rook version inside of a Rook Pod): v1.9.5
  • Storage backend version (e.g. for ceph do ceph -v): ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) (image: ceph/ceph:v17.2.0-20220611)
  • Kubernetes version (use kubectl version): v1.23.5
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
  • Storage backend status (e.g. for Ceph use ceph health in the [Rook Ceph toolbox]:(https://rook.io/docs/rook/latest/Troubleshooting/ceph-toolbox/#interactive-toolbox)): Healthy, except when the mgr crash happens

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

I would suggest disabling the rook module then. If you feel like you’re missing some bit of functionality after that, let us know.

ceph orch set backend ""   # disables the orchestrator
ceph mgr module disable rook   # disable the rook module

Looks good. The solution (for now) seems to be not using the module. Thanks for your help.

OS (e.g. from /etc/os-release): Ubuntu 20.04.4 LTS Kernel (e.g. uname -a): Linux 5.4.0-89-generic https://github.com/rook/rook/issues/100-Ubuntu Cloud provider or hardware configuration: 3 control plane nodes, 4 worker nodes, rook storage on PVC (lvp), setup via kubeadm (CRI-O, Cilium) Rook version (use rook version inside of a Rook Pod): v1.9.5 Storage backend version (e.g. for ceph do ceph -v): ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) (image: ceph/ceph:v17.2.0-20220611) Kubernetes version (use kubectl version): v1.23.5 Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm Storage backend status (e.g. for Ceph use ceph health in the [Rook Ceph toolbox]:(https://rook.io/docs/rook/latest/Troubleshooting/ceph-toolbox/#interactive-toolbox)): Healthy, except when the mgr crash happens

We’re on v1.9.5 and nothing happened while upgrading. I’ve removed the mgr deployment, restarted the operator and let the operator create a new mgr deployment. We’ll monitor the system and post if something changes. Thanks for your help.