rook: 1.11.0 broken with hostnetwork due to new ports on pod
- Bug Report
Deviation from expected behavior: With 1.11.0 every pod defines port 6800 and possibly 6801 – if you have host networking enabled then OSDs are started on the host network and only one pod can use any given port on the same network interface, thus only one OSD or MGR pod (the MGR pod already claims port 6800) can start on any given node.
Expected behavior: You should be able to run many OSDs on the same node without having kubernetes refuse to start them up due to port unavailability.
How to reproduce it (minimal and precise):
- Install Rook with spec.network.provider: host on 1.11.0 with multiple OSDs on one host
Cluster Status to submit:
# ceph -s
cluster:
id: cb82340a-2eaf-4597-b83e-cc0e62a9d019
health: HEALTH_WARN
no active mgr
Degraded data redundancy: 19877/91515 objects degraded (21.720%), 69 pgs degraded
services:
mon: 3 daemons, quorum b,c,d (age 19h)
mgr: no daemons active (since 18h)
mds: 1/1 daemons up, 1 hot standby
osd: 7 osds: 6 up (since 60m), 6 in (since 12h)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 193 pgs
objects: 30.50k objects, 75 GiB
usage: 190 GiB used, 5.1 TiB / 5.2 TiB avail
pgs: 19877/91515 objects degraded (21.720%)
3090/91515 objects misplaced (3.376%)
66 active+undersized+degraded
54 active+clean
50 active+undersized
18 stale+active+clean
3 active+undersized+degraded+remapped+backfilling
1 stale+active+remapped+backfilling
1 active+remapped+backfilling
Environment:
- OS (e.g. from /etc/os-release):
Ubuntu 22.04.2 LTS - Kernel (e.g.
uname -a):5.15.0-60-generic #66-Ubuntu SMP - Cloud provider or hardware configuration: 6 bare medal nodes
- Rook version (use
rook versioninside of a Rook Pod):rook: v1.11.0-14.gd70d8ad60 - Storage backend version (e.g. for ceph do
ceph -v):ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) - Kubernetes version (use
kubectl version):Server Version: v1.25.6+k3s1 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
k3s - Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):HEALTH_WARN no active mgr; Degraded data redundancy: 19877/91515 objects degraded (21.720%), 69 pgs degraded
NOTE: Per #11792 I’m already using rook/ceph:v1.11.0-14.gd70d8ad60. Since that fix worked for the Op and his issue was closed, starting this new issue.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (10 by maintainers)
The
mgr-bissue #11791 is the only outstanding issue I still have. When1.11.1comes out I’ll test again. I’ll close this issue now.Appreciate the support.
Yes, the issue in the title is already fixed by #11797 and is planned for release tomorrow in v1.11.1. Several different issues have been discussed here such as the mgr readiness probe (tracked by #11791). @reefland Shall we close this issue, or what is remaining after those two specific issues are fixed?
This allowed both OSD’s to come on-line, and allowed a mgr to get started:
Some endpoints showed up:
The operator log had a stream of:
Then updated Deployment to OSD4 to remove ports to allow backup mgr to be online:
Which allowed the operator to move forward, not sure what is important to look for.
This no longer hangs:
Ok I have a repro now, will fix the ports on the mgr pod as well so they are only added when multi-cluster network option is enabled, similar to #11797 for the OSDs.