rook: regular rook-ceph-mgr crashes after update to ceph v17.2.0: too old resource version: 383042908 (383043544)

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: Since the update to ceph v17.2.0 we have regular (every 1-2 hours) crashes of the rook module in ceph.

Expected behavior:

How to reproduce it (minimal and precise):

File(s) to submit:

cluster-on-pvc.yml:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph # namespace:cluster
spec:
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
    volumeClaimTemplate:
      spec:
        storageClassName: local-storage-fs
        resources:
          requests:
            storage: 10Gi
  cephVersion:
    image: ceph/ceph:v17.2.0-20220611
    allowUnsupported: false
  skipUpgradeChecks: false
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  mgr:
    count: 1
    modules:
      - name: pg_autoscaler
        enabled: true
  dashboard:
    enabled: true
    ssl: true
  monitoring:
    enabled: true
    rulesNamespace: monitoring
  crashCollector:
    disable: false
  storage:
    storageClassDeviceSets:
    - name: set1
      count: 4
      portable: false
      tuneDeviceClass: false
      tuneFastDeviceClass: false
      encrypted: false
      placement:
        topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd
      preparePlacement:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd-prepare
              topologyKey: kubernetes.io/hostname
        topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd-prepare
      resources:
        limits:
          cpu: "500m"
          memory: "4Gi"
        requests:
          cpu: "500m"
          memory: "4Gi"
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          resources:
            requests:
              storage: 730Gi
          storageClassName: local-storage-block
          volumeMode: Block
          accessModes:
            - ReadWriteOnce
  resources: []
  priorityClassNames:
    mon: system-node-critical
    osd: system-node-critical
    mgr: system-cluster-critical
  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 0
    manageMachineDisruptionBudgets: false
    machineDisruptionBudgetNamespace: openshift-machine-api

ceph crash log

{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/rook/module.py\", line 215, in serve\n    self._apply_drivegroups(list(self._drive_group_map.values()))",
        "  File \"/usr/share/ceph/mgr/rook/module.py\", line 591, in _apply_drivegroups\n    all_hosts = raise_if_exception(self.get_hosts())",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 228, in raise_if_exception\n    raise e",
        "kubernetes.client.rest.ApiException: ({'type': 'ERROR', 'object': {'api_version': 'v1',\n 'kind': 'Status',\n 'metadata': {'annotations': None,\n              'cluster_name': None,\n              'creation_timestamp': None,\n              'deletion_grace_period_seconds': None,\n              'deletion_timestamp': None,\n              'finalizers': None,\n              'generate_name': None,\n              'generation': None,\n              'initializers': None,\n              'labels': None,\n              'managed_fields': None,\n              'name': None,\n              'namespace': None,\n              'owner_references': None,\n              'resource_version': None,\n              'self_link': None,\n              'uid': None},\n 'spec': None,\n 'status': {'addresses': None,\n            'allocatable': None,\n            'capacity': None,\n            'conditions': None,\n            'config': None,\n            'daemon_endpoints': None,\n            'images': None,\n            'node_info': None,\n            'phase': None,\n            'volumes_attached': None,\n            'volumes_in_use': None}}, 'raw_object': {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 383042908 (383043544)', 'reason': 'Expired', 'code': 410}})\nReason: None\n"
    ],
    "ceph_version": "17.2.0",
    "crash_id": "2022-06-15T06:37:23.677990Z_a2a89ea9-1313-4271-91d3-d90d067548c2",
    "entity_name": "mgr.a",
    "mgr_module": "rook",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "ApiException",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "cf609e0280dc50d3bc27b2814523d911b9d5af3c081b60d3182d78e14f834030",
    "timestamp": "2022-06-15T06:37:23.677990Z",
    "utsname_hostname": "rook-ceph-mgr-a-996b8c79b-l528b",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-89-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021"
}

Environment:

OS (e.g. from /etc/os-release): Ubuntu 20.04.4 LTS
Kernel (e.g. uname -a): Linux 5.4.0-89-generic #100-Ubuntu
Cloud provider or hardware configuration: 3 control plane nodes, 4 worker nodes, rook storage on PVC (lvp), setup via kubeadm (CRI-O, Cilium)
Rook version (use rook version inside of a Rook Pod): v1.9.5
Storage backend version (e.g. for ceph do ceph -v): ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) (image: ceph/ceph:v17.2.0-20220611)
Kubernetes version (use kubectl version): v1.23.5
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
Storage backend status (e.g. for Ceph use ceph health in the [Rook Ceph toolbox]:(https://rook.io/docs/rook/latest/Troubleshooting/ceph-toolbox/#interactive-toolbox)): Healthy, except when the mgr crash happens

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (7 by maintainers)

Most upvoted comments

I would suggest disabling the rook module then. If you feel like you’re missing some bit of functionality after that, let us know.

ceph orch set backend ""   # disables the orchestrator
ceph mgr module disable rook   # disable the rook module

BlaineEXE on Jun 28, 2022

Looks good. The solution (for now) seems to be not using the module. Thanks for your help.

danielcb on Jun 30, 2022

OS (e.g. from /etc/os-release): Ubuntu 20.04.4 LTS Kernel (e.g. uname -a): Linux 5.4.0-89-generic https://github.com/rook/rook/issues/100-Ubuntu Cloud provider or hardware configuration: 3 control plane nodes, 4 worker nodes, rook storage on PVC (lvp), setup via kubeadm (CRI-O, Cilium) Rook version (use rook version inside of a Rook Pod): v1.9.5 Storage backend version (e.g. for ceph do ceph -v): ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) (image: ceph/ceph:v17.2.0-20220611) Kubernetes version (use kubectl version): v1.23.5 Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm Storage backend status (e.g. for Ceph use ceph health in the [Rook Ceph toolbox]:(https://rook.io/docs/rook/latest/Troubleshooting/ceph-toolbox/#interactive-toolbox)): Healthy, except when the mgr crash happens

We’re on v1.9.5 and nothing happened while upgrading. I’ve removed the mgr deployment, restarted the operator and let the operator create a new mgr deployment. We’ll monitor the system and post if something changes. Thanks for your help.

danielcb on Jun 24, 2022