rook: Failure on ceph pg dump pgs_brief -> PGDumpBrief unmarshal issue

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: It appears a failed node (which has been removed and cannot be re-added) can send rook-operator into a loop wherein it seems to never remove the OSD / node (even after being removed from the cluster list)

From the operator logs:

2019-05-15 02:02:41.646661 I | exec: Running command: ceph pg dump pgs_brief --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/230761221
2019-05-15 02:02:42.169023 I | exec: dumped pgs_brief
2019-05-15 02:02:42.169942 I | util: retrying after 15s, last error: failed to unmarshal pg dump response: json: cannot unmarshal object into Go value of type []client.PGDumpBrief

Expected behavior: I expect either the pgs_brief can be successfully unmarshaled or eventually the system would abandon the effort and forcibly remove the OSD (since I have replicas: 3, and can tolerate the loss of the OSD’s data if for some reason it is truly corrupted?)

How to reproduce it (minimal and precise):

  1. have cluster with useAllNodes: false
  2. remove node from map
  3. see ceph status at HEALTHY (and OSD as “out”)
  4. delete node’s host machine and all data
  5. rook goes into cannot unmarshal object into Go value of type []client.PGDumpBrief loop
  6. perhaps this message is unrelated? Need to read more about PGDumpBrief to know exactly whats broken here

Environment:

  • OS (e.g. from /etc/os-release): coreos 2079.3.0
  • Kernel (e.g. uname -a): Linux kubesail-k8s-master 4.19.34-coreos #1 SMP Mon Apr 22 20:32:34 -00 2019 x86_64 Intel® Xeon® Gold 6140 CPU @ 2.30GHz GenuineIntel GNU/Linux
  • Cloud provider or hardware configuration: DigitalOcean “16gb”
  • Rook version (use rook version inside of a Rook Pod): rook: v1.0.1
  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.1”, GitCommit:“b7394102d6ef778017f2ca4046abbaa23b88c290”, GitTreeState:“clean”, BuildDate:“2019-04-08T17:11:31Z”, GoVersion:“go1.12.1”, Compiler:“gc”, Platform:“windows/amd64”} Server Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.1”, GitCommit:“b7394102d6ef778017f2ca4046abbaa23b88c290”, GitTreeState:“clean”, BuildDate:“2019-04-08T17:02:58Z”, GoVersion:“go1.12.1”, Compiler:“gc”, Platform:“linux/amd64”}
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments