rook: Upgrade from 1.3 to 1.4 to 1.5 fails with loop: Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: When following the official doc to upgrade from 1.4 to 1.5, after I applied the new common/crd config and updating the operator image (rook-ceph-operator=rook/ceph:v1.5.11) it went into an error loop:
2021-05-09 19:22:01.926068 D | ceph-cluster-controller: node watcher: cluster "rook-ceph" is not ready. skipping orchestration
2021-05-09 19:22:05.715044 D | ceph-spec: "ceph-file-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2021-05-09 19:22:05.715106 D | ceph-spec: "ceph-file-controller": ceph status is "HEALTH_OK", operator is ready to run ceph command, reconciling
2021-05-09 19:22:05.731729 D | op-mon: found existing monitor secrets for cluster rook-ceph
2021-05-09 19:22:05.736536 I | op-mon: parsing mon endpoints: t=10.0.0.13:6789,q=10.0.0.12:6789,p=10.0.0.14:6789,v=10.0.0.11:6789
2021-05-09 19:22:05.736659 D | op-mon: loaded: maxMonID=21, mons=map[p:0xc00167a040 q:0xc00167a000 t:0xc00249ff40 v:0xc00167a0e0], assignment=&{Schedule:map[p:0xc003737400 q:0xc003737440 t:0xc003737480 v:0xc0037374c0]}
2021-05-09 19:22:05.736856 D | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/412686250
2021-05-09 19:22:05.853310 D | exec: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
.
2021-05-09 19:22:05.853433 I | ceph-file-controller: skipping reconcile since operator is still initializing
2021-05-09 19:22:13.511056 D | ceph-cluster-controller: node watcher: cluster "rook-ceph" is not ready. skipping orchestration
2021-05-09 19:22:15.854729 D | ceph-spec: "ceph-file-controller": CephCluster resource "rook-ceph" found in namespace "rook-ceph"
2021-05-09 19:22:15.854795 D | ceph-spec: "ceph-file-controller": ceph status is "HEALTH_OK", operator is ready to run ceph command, reconciling
2021-05-09 19:22:15.863813 D | op-mon: found existing monitor secrets for cluster rook-ceph
2021-05-09 19:22:15.868396 I | op-mon: parsing mon endpoints: t=10.0.0.13:6789,q=10.0.0.12:6789,p=10.0.0.14:6789,v=10.0.0.11:6789
2021-05-09 19:22:15.868492 D | op-mon: loaded: maxMonID=21, mons=map[p:0xc002d5c340 q:0xc002d5c300 t:0xc002d5c2c0 v:0xc002d5c380], assignment=&{Schedule:map[p:0xc0025cc200 q:0xc0025cc240 t:0xc0025cc280 v:0xc0025cc2c0]}
2021-05-09 19:22:15.868663 D | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/323332353
2021-05-09 19:22:15.958244 D | ceph-cluster-controller: hot-plug cm watcher: only reconcile on hot plug cm changes, this "cdi-controller-leader-election-helper" cm is handled by another watcher
2021-05-09 19:22:15.980750 D | exec: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
.
2021-05-09 19:22:15.980981 I | ceph-file-controller: skipping reconcile since operator is still initializing
2021-05-09 19:22:21.317155 D | ceph-cluster-controller: node watcher: cluster "rook-ceph" is not ready. skipping orchestration
The cluster stayed healthy and kept serving as normal, besides the operator looping.
When I tried to run the ceph versions on the operator or the toolbox I also get Error initializing cluster client: ObjectNotFound('error calling conf_read_file',) , and on the toolbox /var/lib/rook doesn’t exist, on the operator it exists but empty, so is /etc/ceph
Reverting to rook-ceph-operator=rook/ceph:v1.4.9 made the operator work again, but running ceph health result in the same ObjectNotFound on both the operator and the toolbox. If I revert to rook/ceph:v1.3.9 on the toolbox lets me talk to the cluster again.
Expected behavior:
After applying the rook-ceph-operator=rook/ceph:v1.5.11 image I expected the cluster to continue to be upgraded to v1.5.11
How to reproduce it (minimal and precise):
- I’ve had the cluster running on v1.3.5 for a long while, realized how far behind the cluster is, and wanting to move on from Nautilus, I decided to do the 1.3>1.4>1.5 migration path.
- 1.3 > 1.4.9 went without any issues, and let the cluster sit for some time
- Started following official doc, applied the new common and CRD yamls, and upgraded the operator image to rook/ceph:v1.5.11
- After re-scheduling the operator went into a loop described above.
- Reverting the operator to 1.4.9 made it work again, but get the same ObjectNotFound on the operator or toolbox pod if I try to do ceph health. Reverting to 1.3.9 on the toolbox lets me talk to the cluster again.
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary - Operator’s logs, if necessary
- Crashing pod(s) logs, if necessary
To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.
Environment:
- OS (e.g. from /etc/os-release): K8S on bare metal (Ubuntu 18.04)
- Kernel (e.g.
uname -a): Linux artemis 4.15.0-142-generic #146-Ubuntu SMP Tue Apr 13 01:11:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - Cloud provider or hardware configuration: K8S via kubeadm on Dell R720xd’s
- Rook version (use
rook versioninside of a Rook Pod): rook: v1.4.9 go: go1.13.8 - Storage backend version (e.g. for ceph do
ceph -v):- On the mons/osds: ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
- On the operator: ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)
- Kubernetes version (use
kubectl version):Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"} - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): K8S on bare metal
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):- With the v1.3.9 toolbox image: HEALTH_OK
- With v1.4.9 or v1.5.11: Error initializing cluster client: ObjectNotFound(‘RADOS object not found (error calling conf_read_file)’,)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (7 by maintainers)
I consider it a good sign that we are exposing this failure message on the cephcluster CR. IMO, that is the best place we can expose errors like this for maximum k8s user visibility. Perhaps we should alter the
PHASEto beErrororWarninginstead ofProgressing, however.Some changes are already present: the dependency changes to CephCluster CRs I made in 1.7 will also expose that error as a reconcile error too. That should help for upgrade cases like this one.
In addition, (@leseb will know more) changes we plan in v1.8 will verify the mon count in the validating webhook which should be enabled for all clusters in the future and make it so a mon count of 4 will be rejected if that change is made. It’s unclear to me how the webhook will behave if a new check is added in v1.8 that makes previously allowed configs invalid. Perhaps that is something Seb should be sure to test?