rook: Update 0.9.3 -> 1.0.0: all PGs unknown, mgr waiting for OSDs
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
After updating the rook version from 0.9.3 to 1.0.0 following the guide the cluster is in a HEALTH_WARN state and all PGs are marked as unknown.
The logs of the mgr pod shows Not sending PG status to monitor yet, waiting for OSDs.
Expected behavior:
After updating the rook version from 0.9.3 to 1.0.0 the cluster should be in a HEALTH_OK state and all PGs should be good.
How to reproduce it (minimal and precise):
- Install a rook cluster using 0.9.3 and, e.g., ceph 13.2.5.
- Follow the update guide to update to 1.0.0.
Environment:
- OS (e.g. from /etc/os-release): Debian 9.9.0, also on CoreOS
- Kernel (e.g.
uname -a):Linux daniel-test 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1 (2019-04-12) x86_64 GNU/Linux - Cloud provider or hardware configuration: Kubernetes bare-metal test cluster with 1 OSD using a drive (/etc/sdb) of 8GB.
- Rook version (use
rook versioninside of a Rook Pod):rook: v1.0.0 - Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): bare-metal
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):
2019-05-08 09:06:39.274 7fec92ffd700 -1 --2- 10.32.3.1:0/2947321591 >> v2:10.32.2.5:6790/0 conn(0x7fec9414ca90 0x7fec9414ee90 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.5:6790/0 is using msgr V1 protocol
2019-05-08 09:06:39.274 7fec93fff700 -1 --2- 10.32.3.1:0/2947321591 >> v2:10.32.2.22:6790/0 conn(0x7fec9414f580 0x7fec94151a50 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.22:6790/0 is using msgr V1 protocol
2019-05-08 09:06:39.314 7fec93fff700 -1 --2- >> v2:10.32.2.22:6790/0 conn(0x7fec94156500 0x7fec9415f290 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.22:6790/0 is using msgr V1 protocol
2019-05-08 09:06:39.318 7fec937fe700 -1 --2- 10.32.3.1:0/2390400519 >> v2:10.32.2.5:6790/0 conn(0x7fec941575d0 0x7fec941555e0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.5:6790/0 is using msgr V1 protocol
HEALTH_WARN Reduced data availability: 800 pgs inactive; mons a,b are low on available space
ceph status output:
2019-05-08 09:08:17.634 7ffabb082700 -1 --2- >> v2:10.32.2.22:6790/0 conn(0x7ffabc066ab0 0x7ffabc066f50 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.22:6790/0 is using msgr V1 protocol
2019-05-08 09:08:17.682 7ffabb883700 -1 --2- >> v2:10.32.2.22:6790/0 conn(0x7ffabc154310 0x7ffabc15d0a0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.22:6790/0 is using msgr V1 protocol
2019-05-08 09:08:17.686 7ffabb082700 -1 --2- 10.32.3.1:0/2301690069 >> v2:10.32.2.5:6790/0 conn(0x7ffabc1553e0 0x7ffabc1533f0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner peer v2:10.32.2.5:6790/0 is using msgr V1 protocol
cluster:
id: 5de51fcd-ed80-47ac-8add-a979a6be3b86
health: HEALTH_WARN
Reduced data availability: 800 pgs inactive
mons a,b are low on available space
services:
mon: 2 daemons, quorum b,a
mgr: a(active)
mds: ceph-fs-1/1/1 up {0=ceph-fs-a=up:active}, 1 up:standby-replay
osd: 1 osds: 1 up, 1 in
data:
pools: 8 pools, 800 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
800 unknown
Log Files:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 19 (5 by maintainers)
Commits related to this issue
- mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
- mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
- mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
- mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
- mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to galexrt/rook by galexrt 5 years ago
- Merge pull request #3647 from galexrt/backport_fix_3136 Backport fix for #3136 — committed to rook/rook by galexrt 5 years ago
- mgr: set public-addr flag for MGR This fixes that the MGR is binding to the gateway IP instead of the actual Pod IP, as seen in #3136. When hostNetwork: true is set the public-addrr is not set so cus... — committed to leseb/rook by galexrt 5 years ago
Hi. I was also plagued by this issue (Rook 1.04, Ceph v13.2.6-20190604)
After some investigating I found out the manager reports the IP address of the default gateway (IP of tun0 on the node) to the mons, like @scroogie also mentioned. I forced the manager to bind on the pod’s IP and everything seems to be working find now. This is what I did to fix/work around the issue:
kubectl edit deployment rook-ceph-mgr-a- --public-addr=$(ROOK_POD_IP)Not sure how long this wil survive the the operator is recreating or ‘fixing’ stuff.
We worked around this by using host networking for now. I’d have to look up the final combination of Ceph and rook version we ended up using, we tried so many. Unfortunately I dont have detailed logs, because we were in a hurry to get it working, but I had the impression that the containers were picking up the wrong network address from somewhere. If I remember correctly there is a file created somewhere which is supposed to contain the correct mon and mds ip, but this never corresponded to the real IPs. Either one of the services had a different IP or a daemon was trying to listen on the wrong IP, which led to an even weirder error message.