rook: Update 0.9.3 -> 1.0.0: all PGs unknown, mgr waiting for OSDs

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: After updating the rook version from 0.9.3 to 1.0.0 following the guide the cluster is in a HEALTH_WARN state and all PGs are marked as unknown. The logs of the mgr pod shows Not sending PG status to monitor yet, waiting for OSDs.

Expected behavior: After updating the rook version from 0.9.3 to 1.0.0 the cluster should be in a HEALTH_OK state and all PGs should be good.

How to reproduce it (minimal and precise):

Install a rook cluster using 0.9.3 and, e.g., ceph 13.2.5.
Follow the update guide to update to 1.0.0.

Environment:

OS (e.g. from /etc/os-release): Debian 9.9.0, also on CoreOS
Kernel (e.g. uname -a): Linux daniel-test 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1 (2019-04-12) x86_64 GNU/Linux
Cloud provider or hardware configuration: Kubernetes bare-metal test cluster with 1 OSD using a drive (/etc/sdb) of 8GB.
Rook version (use rook version inside of a Rook Pod): rook: v1.0.0
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): bare-metal
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

2019-05-08 09:06:39.274 7fec92ffd700 -1 --2- 10.32.3.1:0/2947321591 >> v2:10.32.2.5:6790/0 conn(0x7fec9414ca90 0x7fec9414ee90 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.5:6790/0 is using msgr V1 protocol
2019-05-08 09:06:39.274 7fec93fff700 -1 --2- 10.32.3.1:0/2947321591 >> v2:10.32.2.22:6790/0 conn(0x7fec9414f580 0x7fec94151a50 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.22:6790/0 is using msgr V1 protocol
2019-05-08 09:06:39.314 7fec93fff700 -1 --2-  >> v2:10.32.2.22:6790/0 conn(0x7fec94156500 0x7fec9415f290 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.22:6790/0 is using msgr V1 protocol
2019-05-08 09:06:39.318 7fec937fe700 -1 --2- 10.32.3.1:0/2390400519 >> v2:10.32.2.5:6790/0 conn(0x7fec941575d0 0x7fec941555e0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.5:6790/0 is using msgr V1 protocol
HEALTH_WARN Reduced data availability: 800 pgs inactive; mons a,b are low on available space

ceph status output:

2019-05-08 09:08:17.634 7ffabb082700 -1 --2-  >> v2:10.32.2.22:6790/0 conn(0x7ffabc066ab0 0x7ffabc066f50 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.22:6790/0 is using msgr V1 protocol
2019-05-08 09:08:17.682 7ffabb883700 -1 --2-  >> v2:10.32.2.22:6790/0 conn(0x7ffabc154310 0x7ffabc15d0a0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.22:6790/0 is using msgr V1 protocol
2019-05-08 09:08:17.686 7ffabb082700 -1 --2- 10.32.3.1:0/2301690069 >> v2:10.32.2.5:6790/0 conn(0x7ffabc1553e0 0x7ffabc1533f0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.5:6790/0 is using msgr V1 protocol
  cluster:
    id:     5de51fcd-ed80-47ac-8add-a979a6be3b86
    health: HEALTH_WARN
            Reduced data availability: 800 pgs inactive
            mons a,b are low on available space

  services:
    mon: 2 daemons, quorum b,a
    mgr: a(active)
    mds: ceph-fs-1/1/1 up  {0=ceph-fs-a=up:active}, 1 up:standby-replay
    osd: 1 osds: 1 up, 1 in

  data:
    pools:   8 pools, 800 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             800 unknown

Log Files:

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 19 (5 by maintainers)

Commits related to this issue

mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
Merge pull request #3647 from galexrt/backport_fix_3136 Backport fix for #3136 — committed to rook/rook by galexrt 5 years ago
mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to leseb/rook by galexrt 5 years ago

Most upvoted comments

Hi. I was also plagued by this issue (Rook 1.04, Ceph v13.2.6-20190604)

After some investigating I found out the manager reports the IP address of the default gateway (IP of tun0 on the node) to the mons, like @scroogie also mentioned. I forced the manager to bind on the pod’s IP and everything seems to be working find now. This is what I did to fix/work around the issue:

Edit the deployment: kubectl edit deployment rook-ceph-mgr-a
Add an environment variable to the container section:

        - name: ROOK_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP

Add an extra argument to the container args: - --public-addr=$(ROOK_POD_IP)
Wait for the manager pods to be recreated

Not sure how long this wil survive the the operator is recreating or ‘fixing’ stuff.

erikjanhofstede on Jul 30, 2019

We worked around this by using host networking for now. I’d have to look up the final combination of Ceph and rook version we ended up using, we tried so many. Unfortunately I dont have detailed logs, because we were in a hurry to get it working, but I had the impression that the containers were picking up the wrong network address from somewhere. If I remember correctly there is a file created somewhere which is supposed to contain the correct mon and mds ip, but this never corresponded to the real IPs. Either one of the services had a different IP or a daemon was trying to listen on the wrong IP, which led to an even weirder error message.

scroogie on Jun 7, 2019