rook: Upgrade from 1.3 to 1.4 to 1.5 fails with loop: Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: When following the official doc to upgrade from 1.4 to 1.5, after I applied the new common/crd config and updating the operator image (rook-ceph-operator=rook/ceph:v1.5.11) it went into an error loop:

 2021-05-09 19:22:01.926068 D | ceph-cluster-controller: node watcher: cluster "rook-ceph" is not ready. skipping orchestration
 2021-05-09 19:22:05.715044 D | ceph-spec: "ceph-file-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
 2021-05-09 19:22:05.715106 D | ceph-spec: "ceph-file-controller": ceph status is "HEALTH_OK", operator is ready to run ceph command, reconciling
 2021-05-09 19:22:05.731729 D | op-mon: found existing monitor secrets for cluster rook-ceph
 2021-05-09 19:22:05.736536 I | op-mon: parsing mon endpoints: t=10.0.0.13:6789,q=10.0.0.12:6789,p=10.0.0.14:6789,v=10.0.0.11:6789
 2021-05-09 19:22:05.736659 D | op-mon: loaded: maxMonID=21, mons=map[p:0xc00167a040 q:0xc00167a000 t:0xc00249ff40 v:0xc00167a0e0], assignment=&{Schedule:map[p:0xc003737400 q:0xc003737440 t:0xc003737480 v:0xc0037374c0]}
 2021-05-09 19:22:05.736856 D | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/412686250
 2021-05-09 19:22:05.853310 D | exec: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
 .
 2021-05-09 19:22:05.853433 I | ceph-file-controller: skipping reconcile since operator is still initializing
 2021-05-09 19:22:13.511056 D | ceph-cluster-controller: node watcher: cluster "rook-ceph" is not ready. skipping orchestration
 2021-05-09 19:22:15.854729 D | ceph-spec: "ceph-file-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
 2021-05-09 19:22:15.854795 D | ceph-spec: "ceph-file-controller": ceph status is "HEALTH_OK", operator is ready to run ceph command, reconciling
 2021-05-09 19:22:15.863813 D | op-mon: found existing monitor secrets for cluster rook-ceph
 2021-05-09 19:22:15.868396 I | op-mon: parsing mon endpoints: t=10.0.0.13:6789,q=10.0.0.12:6789,p=10.0.0.14:6789,v=10.0.0.11:6789
 2021-05-09 19:22:15.868492 D | op-mon: loaded: maxMonID=21, mons=map[p:0xc002d5c340 q:0xc002d5c300 t:0xc002d5c2c0 v:0xc002d5c380], assignment=&{Schedule:map[p:0xc0025cc200 q:0xc0025cc240 t:0xc0025cc280 v:0xc0025cc2c0]}
 2021-05-09 19:22:15.868663 D | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/323332353
 2021-05-09 19:22:15.958244 D | ceph-cluster-controller: hot-plug cm watcher: only reconcile on hot plug cm changes, this "cdi-controller-leader-election-helper" cm is handled by another watcher
 2021-05-09 19:22:15.980750 D | exec: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
 .
 2021-05-09 19:22:15.980981 I | ceph-file-controller: skipping reconcile since operator is still initializing
 2021-05-09 19:22:21.317155 D | ceph-cluster-controller: node watcher: cluster "rook-ceph" is not ready. skipping orchestration

The cluster stayed healthy and kept serving as normal, besides the operator looping.

When I tried to run the ceph versions on the operator or the toolbox I also get Error initializing cluster client: ObjectNotFound('error calling conf_read_file',) , and on the toolbox /var/lib/rook doesn’t exist, on the operator it exists but empty, so is /etc/ceph

Reverting to rook-ceph-operator=rook/ceph:v1.4.9 made the operator work again, but running ceph health result in the same ObjectNotFound on both the operator and the toolbox. If I revert to rook/ceph:v1.3.9 on the toolbox lets me talk to the cluster again.

Expected behavior:

After applying the rook-ceph-operator=rook/ceph:v1.5.11 image I expected the cluster to continue to be upgraded to v1.5.11

How to reproduce it (minimal and precise):

I’ve had the cluster running on v1.3.5 for a long while, realized how far behind the cluster is, and wanting to move on from Nautilus, I decided to do the 1.3>1.4>1.5 migration path.
1.3 > 1.4.9 went without any issues, and let the cluster sit for some time
Started following official doc, applied the new common and CRD yamls, and upgraded the operator image to rook/ceph:v1.5.11
After re-scheduling the operator went into a loop described above.
Reverting the operator to 1.4.9 made it work again, but get the same ObjectNotFound on the operator or toolbox pod if I try to do ceph health. Reverting to 1.3.9 on the toolbox lets me talk to the cluster again.

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary
Operator’s logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read Github documentation if you need help.

Environment:

OS (e.g. from /etc/os-release): K8S on bare metal (Ubuntu 18.04)
Kernel (e.g. uname -a): Linux artemis 4.15.0-142-generic #146-Ubuntu SMP Tue Apr 13 01:11:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: K8S via kubeadm on Dell R720xd’s
Rook version (use rook version inside of a Rook Pod): rook: v1.4.9 go: go1.13.8
Storage backend version (e.g. for ceph do ceph -v):
- On the mons/osds: ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
- On the operator: ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)
Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): K8S on bare metal
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
- With the v1.3.9 toolbox image: HEALTH_OK
- With v1.4.9 or v1.5.11: Error initializing cluster client: ObjectNotFound(‘RADOS object not found (error calling conf_read_file)’,)

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 17 (7 by maintainers)

Most upvoted comments

I consider it a good sign that we are exposing this failure message on the cephcluster CR. IMO, that is the best place we can expose errors like this for maximum k8s user visibility. Perhaps we should alter the PHASE to be Error or Warning instead of Progressing, however.

Some changes are already present: the dependency changes to CephCluster CRs I made in 1.7 will also expose that error as a reconcile error too. That should help for upgrade cases like this one.

In addition, (@leseb will know more) changes we plan in v1.8 will verify the mon count in the validating webhook which should be enabled for all clusters in the future and make it so a mon count of 4 will be rejected if that change is made. It’s unclear to me how the webhook will behave if a new check is added in v1.8 that makes previously allowed configs invalid. Perhaps that is something Seb should be sure to test?

BlaineEXE on Aug 30, 2021