rook: Failure on ceph pg dump pgs_brief -> PGDumpBrief unmarshal issue

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: It appears a failed node (which has been removed and cannot be re-added) can send rook-operator into a loop wherein it seems to never remove the OSD / node (even after being removed from the cluster list)

From the operator logs:

2019-05-15 02:02:41.646661 I | exec: Running command: ceph pg dump pgs_brief --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/230761221
2019-05-15 02:02:42.169023 I | exec: dumped pgs_brief
2019-05-15 02:02:42.169942 I | util: retrying after 15s, last error: failed to unmarshal pg dump response: json: cannot unmarshal object into Go value of type []client.PGDumpBrief

Expected behavior: I expect either the pgs_brief can be successfully unmarshaled or eventually the system would abandon the effort and forcibly remove the OSD (since I have replicas: 3, and can tolerate the loss of the OSD’s data if for some reason it is truly corrupted?)

How to reproduce it (minimal and precise):

have cluster with useAllNodes: false
remove node from map
see ceph status at HEALTHY (and OSD as “out”)
delete node’s host machine and all data
rook goes into cannot unmarshal object into Go value of type []client.PGDumpBrief loop
perhaps this message is unrelated? Need to read more about PGDumpBrief to know exactly whats broken here

Environment:

OS (e.g. from /etc/os-release): coreos 2079.3.0
Kernel (e.g. uname -a): Linux kubesail-k8s-master 4.19.34-coreos #1 SMP Mon Apr 22 20:32:34 -00 2019 x86_64 Intel® Xeon® Gold 6140 CPU @ 2.30GHz GenuineIntel GNU/Linux
Cloud provider or hardware configuration: DigitalOcean “16gb”
Rook version (use rook version inside of a Rook Pod): rook: v1.0.1
Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.1”, GitCommit:“b7394102d6ef778017f2ca4046abbaa23b88c290”, GitTreeState:“clean”, BuildDate:“2019-04-08T17:11:31Z”, GoVersion:“go1.12.1”, Compiler:“gc”, Platform:“windows/amd64”} Server Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.1”, GitCommit:“b7394102d6ef778017f2ca4046abbaa23b88c290”, GitTreeState:“clean”, BuildDate:“2019-04-08T17:02:58Z”, GoVersion:“go1.12.1”, Compiler:“gc”, Platform:“linux/amd64”}
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (9 by maintainers)

Most upvoted comments

Looks like that is the recommended method:

https://github.com/rook/rook/blob/master/Documentation/ceph-upgrade.md#patch-release-upgrades

dotnwat on Jul 3, 2019