rook: Crash collector deployment fails when 2 ceph clusters are managed by the operator
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: We are running 2 ceph clusters in different namespaces on the same kubernetes cluster. The rook operator is also located in one of the namespace. Crash collectors were initially enabled. When we deployed the second ceph cluster, several crash collectors began to start on the same nodes, failing to start and retrying endlessly. Disabling/enabling the crash collectors on the first cluster also drove them crazy. We ended up disabling the crash collectors on both clusters.
Notes:
- when removing the second cluster, crash collectors can be restarted again correctly on the remaining cluster.
- We did not test running the operator in its own separate namespace to check if this has some impact
- Last line of the provided operator’s log show an error concerning the crash-collector (cause or consequence ?)
Expected behavior: A single crash collector should start on each daemon node, on both clusters.
How to reproduce it (minimal and precise):
- deploy rook-ceph operator
- deploy 2 ceph clusters with crash collectors disabled, 1 in the operator namespace, 1 in another namespace.
- enable crash collector on one cluster
File(s) to submit:
- Operator’s logs extract
2022-01-20 10:24:21.877351 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.100.26" successfully removed
2022-01-20 10:24:21.877827 I | ceph-spec: object "rook-ceph-crashcollector-10.100.100.26" matched on delete, reconciling
2022-01-20 10:24:21.967687 I | ceph-spec: object "rook-ceph-crashcollector-10.100.100.26" matched on delete, reconciling
2022-01-20 10:24:21.967752 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.100.26" successfully removed
2022-01-20 10:24:22.643383 I | op-k8sutil: batch job rook-ceph-osd-prepare-10.100.101.208 deleted
2022-01-20 10:24:22.651395 I | op-osd: started OSD provisioning job for node "10.100.101.208"
2022-01-20 10:24:22.655092 I | op-osd: OSD orchestration status for node 10.100.100.125 is "completed"
2022-01-20 10:24:22.662398 I | op-osd: OSD orchestration status for node 10.100.100.26 is "orchestrating"
2022-01-20 10:24:22.662417 I | op-osd: OSD orchestration status for node 10.100.101.154 is "starting"
2022-01-20 10:24:22.662424 I | op-osd: OSD orchestration status for node 10.100.101.208 is "starting"
2022-01-20 10:24:22.755614 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:22.757188 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:24.561857 I | op-mgr: successful modules: dashboard
2022-01-20 10:24:26.077306 I | op-osd: updating OSD 0 on node "10.100.100.26"
2022-01-20 10:24:26.107790 I | op-osd: OSD orchestration status for node 10.100.101.154 is "orchestrating"
2022-01-20 10:24:26.165883 I | op-osd: OSD orchestration status for node 10.100.100.26 is "completed"
2022-01-20 10:24:26.762425 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:26.762756 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:26.864220 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:26.864385 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:26.943077 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:26.943187 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:26.965778 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:26.966555 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:27.042851 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:27.043019 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:27.063095 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:27.063787 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:27.165642 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:27.166536 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:27.265103 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:27.265120 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:27.368447 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:27.368980 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:27.443025 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:27.443291 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:28.777138 I | op-osd: updating OSD 1 on node "10.100.101.154"
2022-01-20 10:24:28.805537 I | op-osd: OSD orchestration status for node 10.100.101.208 is "orchestrating"
2022-01-20 10:24:31.177486 I | op-osd: updating OSD 2 on node "10.100.101.208"
2022-01-20 10:24:31.208534 I | op-osd: OSD orchestration status for node 10.100.101.154 is "completed"
2022-01-20 10:24:31.960314 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.100.26" successfully removed
2022-01-20 10:24:31.960454 I | ceph-spec: object "rook-ceph-crashcollector-10.100.100.26" matched on delete, reconciling
2022-01-20 10:24:33.756117 I | op-osd: updating OSD 3 on node "10.100.100.125"
2022-01-20 10:24:33.790296 I | op-osd: OSD orchestration status for node 10.100.101.208 is "completed"
2022-01-20 10:24:34.473000 I | op-osd: finished running OSDs in namespace "my-namespace"
2022-01-20 10:24:34.473023 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "my-namespace"
2022-01-20 10:24:34.488495 I | ceph-cluster-controller: reconciling ceph cluster in namespace "my-namespace"
2022-01-20 10:24:34.494803 I | op-mon: parsing mon endpoints: a=10.246.246.105:6789
2022-01-20 10:24:34.514506 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v16.2.7...
2022-01-20 10:24:36.740760 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:36.741620 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:36.767834 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:36.767861 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:36.799838 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:36.802726 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:36.805695 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:36.834360 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:36.834373 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:36.855397 I | ceph-crashcollector-controller: crash collector deployment "rook-ceph-crashcollector-10.100.101.208" successfully removed
2022-01-20 10:24:36.857614 I | ceph-spec: object "rook-ceph-crashcollector-10.100.101.208" matched on delete, reconciling
2022-01-20 10:24:36.864131 E | ceph-crashcollector-controller: node reconcile failed on op "unchanged": Operation cannot be fulfilled on deployments.apps "rook-ceph-crashcollector-10.100.101.208": StorageError: invalid object, Code: 4, Key: /registry/deployments/my-namespace/rook-ceph-crashcollector-10.100.101.208, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 5e589a7e-934c-4a6c-88e0-6399db6f560e, UID in object meta:
Environment:
- OS: EulerOS 2.0 (SP5)
- Kernel: 3.10.0-862.14.1.5.h470.eulerosv2r7.x86_64
- Rook version: 1.8.1
- Storage backend version: 16.2.7
- Kubernetes version: v1.19.10-r1.0.0
- Kubernetes cluster type: Cloud Container Engine (CCE)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (9 by maintainers)
you can get the yaml while terminating the operator it leaves still 0-n crashcollector deployments this needs to be modified for each host