rook: regular rook-ceph-mgr crashes after update to ceph v17.2.0: too old resource version: 383042908 (383043544)
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: Since the update to ceph v17.2.0 we have regular (every 1-2 hours) crashes of the rook module in ceph.
Expected behavior:
How to reproduce it (minimal and precise):
File(s) to submit:
- cluster-on-pvc.yml:
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph # namespace:cluster
spec:
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
volumeClaimTemplate:
spec:
storageClassName: local-storage-fs
resources:
requests:
storage: 10Gi
cephVersion:
image: ceph/ceph:v17.2.0-20220611
allowUnsupported: false
skipUpgradeChecks: false
continueUpgradeAfterChecksEvenIfNotHealthy: false
mgr:
count: 1
modules:
- name: pg_autoscaler
enabled: true
dashboard:
enabled: true
ssl: true
monitoring:
enabled: true
rulesNamespace: monitoring
crashCollector:
disable: false
storage:
storageClassDeviceSets:
- name: set1
count: 4
portable: false
tuneDeviceClass: false
tuneFastDeviceClass: false
encrypted: false
placement:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-osd
preparePlacement:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-osd
- key: app
operator: In
values:
- rook-ceph-osd-prepare
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-osd-prepare
resources:
limits:
cpu: "500m"
memory: "4Gi"
requests:
cpu: "500m"
memory: "4Gi"
volumeClaimTemplates:
- metadata:
name: data
spec:
resources:
requests:
storage: 730Gi
storageClassName: local-storage-block
volumeMode: Block
accessModes:
- ReadWriteOnce
resources: []
priorityClassNames:
mon: system-node-critical
osd: system-node-critical
mgr: system-cluster-critical
disruptionManagement:
managePodBudgets: true
osdMaintenanceTimeout: 30
pgHealthCheckTimeout: 0
manageMachineDisruptionBudgets: false
machineDisruptionBudgetNamespace: openshift-machine-api
- ceph crash log
{
"backtrace": [
" File \"/usr/share/ceph/mgr/rook/module.py\", line 215, in serve\n self._apply_drivegroups(list(self._drive_group_map.values()))",
" File \"/usr/share/ceph/mgr/rook/module.py\", line 591, in _apply_drivegroups\n all_hosts = raise_if_exception(self.get_hosts())",
" File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 228, in raise_if_exception\n raise e",
"kubernetes.client.rest.ApiException: ({'type': 'ERROR', 'object': {'api_version': 'v1',\n 'kind': 'Status',\n 'metadata': {'annotations': None,\n 'cluster_name': None,\n 'creation_timestamp': None,\n 'deletion_grace_period_seconds': None,\n 'deletion_timestamp': None,\n 'finalizers': None,\n 'generate_name': None,\n 'generation': None,\n 'initializers': None,\n 'labels': None,\n 'managed_fields': None,\n 'name': None,\n 'namespace': None,\n 'owner_references': None,\n 'resource_version': None,\n 'self_link': None,\n 'uid': None},\n 'spec': None,\n 'status': {'addresses': None,\n 'allocatable': None,\n 'capacity': None,\n 'conditions': None,\n 'config': None,\n 'daemon_endpoints': None,\n 'images': None,\n 'node_info': None,\n 'phase': None,\n 'volumes_attached': None,\n 'volumes_in_use': None}}, 'raw_object': {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 383042908 (383043544)', 'reason': 'Expired', 'code': 410}})\nReason: None\n"
],
"ceph_version": "17.2.0",
"crash_id": "2022-06-15T06:37:23.677990Z_a2a89ea9-1313-4271-91d3-d90d067548c2",
"entity_name": "mgr.a",
"mgr_module": "rook",
"mgr_module_caller": "PyModuleRunner::serve",
"mgr_python_exception": "ApiException",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "cf609e0280dc50d3bc27b2814523d911b9d5af3c081b60d3182d78e14f834030",
"timestamp": "2022-06-15T06:37:23.677990Z",
"utsname_hostname": "rook-ceph-mgr-a-996b8c79b-l528b",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-89-generic",
"utsname_sysname": "Linux",
"utsname_version": "#100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021"
}
Environment:
- OS (e.g. from /etc/os-release): Ubuntu 20.04.4 LTS
- Kernel (e.g.
uname -a): Linux 5.4.0-89-generic #100-Ubuntu - Cloud provider or hardware configuration: 3 control plane nodes, 4 worker nodes, rook storage on PVC (lvp), setup via kubeadm (CRI-O, Cilium)
- Rook version (use
rook versioninside of a Rook Pod): v1.9.5 - Storage backend version (e.g. for ceph do
ceph -v): ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) (image: ceph/ceph:v17.2.0-20220611) - Kubernetes version (use
kubectl version): v1.23.5 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
- Storage backend status (e.g. for Ceph use
ceph healthin the [Rook Ceph toolbox]:(https://rook.io/docs/rook/latest/Troubleshooting/ceph-toolbox/#interactive-toolbox)): Healthy, except when the mgr crash happens
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (7 by maintainers)
I would suggest disabling the rook module then. If you feel like you’re missing some bit of functionality after that, let us know.
Looks good. The solution (for now) seems to be not using the module. Thanks for your help.
We’re on v1.9.5 and nothing happened while upgrading. I’ve removed the mgr deployment, restarted the operator and let the operator create a new mgr deployment. We’ll monitor the system and post if something changes. Thanks for your help.