rook: Failure on ceph pg dump pgs_brief -> PGDumpBrief unmarshal issue
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: It appears a failed node (which has been removed and cannot be re-added) can send rook-operator into a loop wherein it seems to never remove the OSD / node (even after being removed from the cluster list)
From the operator logs:
2019-05-15 02:02:41.646661 I | exec: Running command: ceph pg dump pgs_brief --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/230761221
2019-05-15 02:02:42.169023 I | exec: dumped pgs_brief
2019-05-15 02:02:42.169942 I | util: retrying after 15s, last error: failed to unmarshal pg dump response: json: cannot unmarshal object into Go value of type []client.PGDumpBrief
Expected behavior: I expect either the pgs_brief can be successfully unmarshaled or eventually the system would abandon the effort and forcibly remove the OSD (since I have replicas: 3, and can tolerate the loss of the OSD’s data if for some reason it is truly corrupted?)
How to reproduce it (minimal and precise):
- have cluster with
useAllNodes: false - remove node from map
- see
ceph statusat HEALTHY (and OSD as “out”) - delete node’s host machine and all data
- rook goes into
cannot unmarshal object into Go value of type []client.PGDumpBriefloop - perhaps this message is unrelated? Need to read more about PGDumpBrief to know exactly whats broken here
Environment:
- OS (e.g. from /etc/os-release): coreos 2079.3.0
- Kernel (e.g.
uname -a): Linux kubesail-k8s-master 4.19.34-coreos #1 SMP Mon Apr 22 20:32:34 -00 2019 x86_64 Intel® Xeon® Gold 6140 CPU @ 2.30GHz GenuineIntel GNU/Linux - Cloud provider or hardware configuration: DigitalOcean “16gb”
- Rook version (use
rook versioninside of a Rook Pod): rook: v1.0.1 - Kubernetes version (use
kubectl version): Client Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.1”, GitCommit:“b7394102d6ef778017f2ca4046abbaa23b88c290”, GitTreeState:“clean”, BuildDate:“2019-04-08T17:11:31Z”, GoVersion:“go1.12.1”, Compiler:“gc”, Platform:“windows/amd64”} Server Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.1”, GitCommit:“b7394102d6ef778017f2ca4046abbaa23b88c290”, GitTreeState:“clean”, BuildDate:“2019-04-08T17:02:58Z”, GoVersion:“go1.12.1”, Compiler:“gc”, Platform:“linux/amd64”} - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox): HEALTH_OK
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (9 by maintainers)
Looks like that is the recommended method:
https://github.com/rook/rook/blob/master/Documentation/ceph-upgrade.md#patch-release-upgrades