rook: OSD Processor -- bind unable to bind to IP on any port in range 6800-7300: (99) Cannot assign requested address

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

I’m running OpenShift v3.11.82 on 3 nodes using RHEL7 with FLEXVOLUME_DIR_PATH set to /usr/libexec/kubernetes/kubelet-plugins/volume/exec(no other changes to *.yml files). I’ve encountered OSD error “Processor – bind unable to bind to v2:IP:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address” on all my 3 OSD.

Expected behavior:

Bind to address and start to listen on port. rook+ceph should work as expexted.

How to reproduce it (minimal and precise):

Environment:

OS (e.g. from /etc/os-release): RHEL7
Kernel (e.g. uname -a): Linux node2.system10.vlan124.mcp 3.10.0-957.10.1.el7.x86_64 #1 SMP Thu Feb 7 07:12:53 UTC 2019 x86_64 x86_64 x86_64 GN
Cloud provider or hardware configuration: On-prem (VMVare based)
Rook version (use rook version inside of a Rook Pod): 1.0.0
Kubernetes version (use kubectl version): OpenShift v.3.11.82 (aka. K8s v1.11)
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): OpenShift
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

LOGS

rook-ceph-osd-0

2019-05-08 17:01:38.770518 I | rookcmd: starting Rook v1.0.0-13.g05b0166 with arguments '/rook/rook ceph osd start -- --foreground --id 0 --osd-uuid c05a22dc-97e5-4463-b4e6-ebd9b00d0f2e --conf /var/lib/rook/osd0/rook-ceph.config --cluster ceph --default-log-to-file false'
2019-05-08 17:01:38.776349 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=INFO, --osd-id=0, --osd-store-type=bluestore, --osd-uuid=c05a22dc-97e5-4463-b4e6-ebd9b00d0f2e
2019-05-08 17:01:38.776480 I | op-mon: parsing mon endpoints: 
2019-05-08 17:01:38.776577 W | op-mon: ignoring invalid monitor 
2019-05-08 17:01:38.784163 I | exec: Running command: stdbuf -oL ceph-volume lvm activate --no-systemd --bluestore 0 c05a22dc-97e5-4463-b4e6-ebd9b00d0f2e
2019-05-08 17:01:39.683863 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
2019-05-08 17:01:40.012259 I | Running command: /usr/sbin/restorecon /var/lib/ceph/osd/ceph-0
2019-05-08 17:01:40.312664 I | Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
2019-05-08 17:01:40.648273 I | Running command: /bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-aaa9dfa1-6146-4a9d-9b19-781bf0f71abe/osd-data-06f7e1bb-336c-4b7f-860f-39ce2217ef5c --path /var/lib/ceph/osd/ceph-0 --no-mon-config
2019-05-08 17:01:41.200711 I | Running command: /bin/ln -snf /dev/ceph-aaa9dfa1-6146-4a9d-9b19-781bf0f71abe/osd-data-06f7e1bb-336c-4b7f-860f-39ce2217ef5c /var/lib/ceph/osd/ceph-0/block
2019-05-08 17:01:41.508309 I | Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block
2019-05-08 17:01:41.813803 I | Running command: /bin/chown -R ceph:ceph /dev/mapper/ceph--aaa9dfa1--6146--4a9d--9b19--781bf0f71abe-osd--data--06f7e1bb--336c--4b7f--860f--39ce2217ef5c
2019-05-08 17:01:42.097564 I | Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
2019-05-08 17:01:42.388105 I | --> ceph-volume lvm activate successful for osd ID: 0
2019-05-08 17:01:42.400127 I | exec: Running command: ceph-osd --foreground --id 0 --osd-uuid c05a22dc-97e5-4463-b4e6-ebd9b00d0f2e --conf /var/lib/rook/osd0/rook-ceph.config --cluster ceph --default-log-to-file false
2019-05-08 17:01:43.178290 I | 2019-05-08 17:01:43.177 7fe1fd660d80 -1 Falling back to public interface
2019-05-08 17:01:43.829540 I | 2019-05-08 17:01:43.829 7fe1fd660d80 -1 osd.0 69 log_to_monitors {default=true}
2019-05-08 17:01:43.847126 I | 2019-05-08 17:01:43.846 7fe1efef6700 -1 osd.0 69 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2019-05-08 17:01:54.275215 I | 2019-05-08 17:01:54.274 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:01:54.275251 I | 2019-05-08 17:01:54.274 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:01:59.291147 I | 2019-05-08 17:01:59.290 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:01:59.291193 I | 2019-05-08 17:01:59.290 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:04.308557 I | 2019-05-08 17:02:04.308 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:04.308596 I | 2019-05-08 17:02:04.308 7fe1e5ee2700 -1  Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address
2019-05-08 17:02:04.320883 I | 2019-05-08 17:02:04.320 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:04.320920 I | 2019-05-08 17:02:04.320 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:09.332874 I | 2019-05-08 17:02:09.332 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:09.332911 I | 2019-05-08 17:02:09.332 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:14.345173 I | 2019-05-08 17:02:14.344 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:14.345214 I | 2019-05-08 17:02:14.344 7fe1e5ee2700 -1  Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address
2019-05-08 17:02:14.359483 I | 2019-05-08 17:02:14.358 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:14.359524 I | 2019-05-08 17:02:14.358 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:19.372082 I | 2019-05-08 17:02:19.371 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:19.372114 I | 2019-05-08 17:02:19.371 7fe1e5ee2700 -1  Processor -- bind was unable to bind. Trying again in 5 seconds 
2019-05-08 17:02:24.385045 I | 2019-05-08 17:02:24.384 7fe1e5ee2700 -1  Processor -- bind unable to bind to v2:10.130.0.1:7300/2027 on any port in range 6800-7300: (99) Cannot assign requested address
2019-05-08 17:02:24.385079 I | 2019-05-08 17:02:24.384 7fe1e5ee2700 -1  Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address
2019-05-08 17:02:24.385338 I | 2019-05-08 17:02:24.384 7fe1f3786700 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2019-05-08 17:02:24.385351 I | 2019-05-08 17:02:24.384 7fe1f3786700 -1 osd.0 74 *** Got signal Interrupt ***

osd-0 config

[global]
fsid                      = e9f14792-30db-427d-8537-73e5b73a91ac
run dir                   = /var/lib/rook/osd0
mon initial members       = c a b
mon host                  = v1:172.30.246.255:6789,v1:172.30.152.231:6789,v1:172.30.117.179:6789
public addr               = 10.130.0.19
cluster addr              = 10.130.0.19
mon keyvaluedb            = rocksdb
mon_allow_pool_delete     = true
mon_max_pg_per_osd        = 1000
debug default             = 0
debug rados               = 0
debug mon                 = 0
debug osd                 = 0
debug bluestore           = 0
debug filestore           = 0
debug journal             = 0
debug leveldb             = 0
filestore_omap_backend    = rocksdb
osd pg bits               = 11
osd pgp bits              = 11
osd pool default size     = 1
osd pool default min size = 1
osd pool default pg num   = 100
osd pool default pgp num  = 100
osd objectstore           = filestore
crush location            = root=default host=node2-system10-vlan124-mcp
rbd_default_features      = 3
fatal signal handlers     = false

[osd.0]
keyring              = /var/lib/ceph/osd/ceph-0/keyring
bluestore block path = /var/lib/ceph/osd/ceph-0/block

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 52 (23 by maintainers)

Commits related to this issue

ceph: osd: fix startup on sdn This commit adds a new flag to the osd startup so that on msgr2 (default on Nautilus and above) the osd is able to find the IP address in the container to bind to. This... — committed to leseb/rook by leseb 5 years ago
ceph: osd: fix startup on sdn This commit adds a new flag to the osd startup so that on msgr2 (default on Nautilus and above) the osd is able to find the IP address in the container to bind to. This... — committed to leseb/rook by leseb 5 years ago
ceph: osd: fix startup on sdn This commit adds a new flag to the osd startup so that on msgr2 (default on Nautilus and above) the osd is able to find the IP address in the container to bind to. This... — committed to sabbot/rook by leseb 5 years ago
ceph: osd: fix startup on sdn This commit adds a new flag to the osd startup so that on msgr2 (default on Nautilus and above) the osd is able to find the IP address in the container to bind to. This... — committed to travisn/rook by leseb 5 years ago

Most upvoted comments

@travisn , still getting the error with 14.2.4, as in comment-563671883

For this container image and args:

  osd:
    Container ID: ____
    Image:         ___/ceph:v14.2.4-20190917
    Image ID:   ___
    Port:          <none>
    Host Port:     <none>
    Command:
      ceph-osd
    Args:
      --foreground
      --id
      1
      --conf
      /var/lib/docker/persistent/rook/osd1/kube-system.config
      --osd-data
      /var/lib/docker/persistent/rook/osd1
      --keyring
      /var/lib/docker/persistent/rook/osd1/keyring
      --cluster
      kube-system
      --osd-uuid
      ffda19d8-51ae-4b76-974f-6f0bdc64c1a4
      --default-log-to-file
      false
      --ms-learn-addr-from-peer=false

osd pod logs are:

2020-02-20 11:24:13.796 7f7faa09fdc0 -1 Falling back to public interface
2020-02-20 11:24:13.815 7f7faa09fdc0 -1  Processor -- bind unable to bind to v2:10.244.2.138:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address
2020-02-20 11:24:13.815 7f7faa09fdc0 -1  Processor -- bind was unable to bind. Trying again in 5 seconds
2020-02-20 11:24:18.831 7f7faa09fdc0 -1  Processor -- bind unable to bind to v2:10.244.2.138:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address
2020-02-20 11:24:18.831 7f7faa09fdc0 -1  Processor -- bind was unable to bind. Trying again in 5 seconds

lsoica on Feb 20, 2020

@travisn It seems that the issue is not a port conflict, but the ceph-osd daemon picking the wrong IP address.

In my scenario above, the config file contained the right value for the pod IP:

+ kubectl -n rook-ceph exec rook-ceph-osd-2-86d8b7cbd9-brm6w -- cat /var/lib/rook/osd2/rook-ceph.config
[global]
...
public addr               = 10.34.0.9
cluster addr              = 10.34.0.9
...

but the OSD process, somewhere in pick_address(), decided to use 10.34.0.0 instead.

Also in pick_addresses.cc the code is looking at other config values (like public_network, for instance), which are not set in the config file passed to ceph-osd.

Perhaps sprinkling log statements in pick_address.cc and comparing the outputs of ceph-osd with hostNetwork: true (where it works) and hostNetwork: false could be of help …

carletes on Jun 15, 2019

It works! Thanks a lot, @leseb. I’m able to create an object store, and to read/write data from/to it.

I’m running now with the following configuration (diff from rook v1.0.2):

$ git diff
diff --git a/cluster/examples/kubernetes/ceph/cluster.yaml b/cluster/examples/kubernetes/ceph/cluster.yaml
index cb130e88..aacad270 100644
--- a/cluster/examples/kubernetes/ceph/cluster.yaml
+++ b/cluster/examples/kubernetes/ceph/cluster.yaml
@@ -20,10 +20,10 @@ spec:
     # v12 is luminous, v13 is mimic, and v14 is nautilus.
     # RECOMMENDATION: In production, use a specific version tag instead of the general v14 flag, which pulls the latest release and could result in different
     # versions running within the cluster. See tags available at https://hub.docker.com/r/ceph/ceph/tags/.
-    image: ceph/ceph:v14.2.1-20190430
+    image: ceph/daemon:latest-master
     # Whether to allow unsupported versions of Ceph. Currently luminous, mimic and nautilus are supported, with the recommendation to upgrade to nautilus.
     # Do not set to true in production.
-    allowUnsupported: false
+    allowUnsupported: true
   # The path on the host where configuration files will be persisted. Must be specified.
   # Important: if you reinstall the cluster, make sure you delete this directory from each host or else the mons will fail to start on the new cluster.
   # In Minikube, the '/data' directory is configured to persist across reboots. Use "/data/rook" in Minikube environment.
@@ -77,6 +77,13 @@ spec:
 # If no mgr annotations are set, prometheus scrape annotations will be set by default.
 #   mgr:
   resources:
+    mgr:
+      limits:
+        cpu: "1"
+        memory: "1024Mi"
+      requests:
+        cpu: "1"
+        memory: "1024Mi"
 # The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
 #    mgr:
 #      limits:
@@ -90,8 +97,8 @@ spec:
 #    osd:
   storage: # cluster level storage configuration and selection
     useAllNodes: true
-    useAllDevices: true
-    deviceFilter:
+    useAllDevices: false
+    deviceFilter: vdb
     location:
     config:
       # The default and recommended storeType is dynamically set to bluestore for devices and filestore for directories.
carlos@ubuntu-vm:~/src/rook/rook$

The resources snipper for mgr is to address a memory leak I reported earlier, nothing to do with this issue.

With this setup I see no errors in the logs, and Ceph looks healthy:

$ bash -x ./ceph-toolbox ceph status
++ kubectl -n rook-ceph get pods -lapp=rook-ceph-tools -o json
++ jq '.items[0].metadata.name'
++ tr -d '"'
+ pod=rook-ceph-tools-6488545f5b-dmf5q
+ '[' -n 'ceph status' ']'
+ exec kubectl -n rook-ceph exec rook-ceph-tools-6488545f5b-dmf5q -- ceph status
  cluster:
    id:     8051ed48-118a-458f-a438-2f9999241081
    health: HEALTH_WARN
            crush map has legacy tunables (require firefly, min is hammer)
 
  services:
    mon: 3 daemons, quorum a,b,c (age 27m)
    mgr: a(active, since 24m)
    osd: 4 osds: 4 up (since 25m), 4 in (since 25m)
    rgw: 1 daemon active (radosgw.gateway)
 
  data:
    pools:   6 pools, 600 pgs
    objects: 206 objects, 4.4 KiB
    usage:   4.0 GiB used, 392 GiB / 396 GiB avail
    pgs:     600 active+clean
 
$ bash -x ./ceph-toolbox ceph osd status
++ kubectl -n rook-ceph get pods -lapp=rook-ceph-tools -o json
++ jq '.items[0].metadata.name'
++ tr -d '"'
+ pod=rook-ceph-tools-6488545f5b-dmf5q
+ '[' -n 'ceph osd status' ']'
+ exec kubectl -n rook-ceph exec rook-ceph-tools-6488545f5b-dmf5q -- ceph osd status
+----+--------------+-------+-------+--------+---------+--------+---------+-----------+
| id |     host     |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+--------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | k8s-worker-1 | 1030M | 97.9G |    0   |     0   |    0   |     0   | exists,up |
| 1  | k8s-worker-0 | 1029M | 97.9G |    0   |     0   |    0   |     0   | exists,up |
| 2  | k8s-worker-3 | 1030M | 97.9G |    0   |     0   |    0   |     0   | exists,up |
| 3  | k8s-worker-2 | 1030M | 97.9G |    0   |     0   |    0   |     0   | exists,up |
+----+--------------+-------+-------+--------+---------+--------+---------+-----------+
$

Notice the health warning bit “crush map has legacy tunables (require firefly, min is hammer)” — anything to worry about?

carletes on Jun 17, 2019

@Blaine @leseb Any ideas on this issue with OSDs having port conflicts? #3189 also seems to be the same issue. It smells like we have a configuration issue and we’ll need to backport a fix to 1.0.

We didn’t overhaul the osd config like the other daemons, but @Blaine can you think of config changes that might have affected the OSDs?
@leseb Are there osd settings that might be affecting the bindings?

travisn on Jun 14, 2019

@gtorre you should always upgrade if a new version is available 😃. Right now, the only Ceph version that has the fix is Ceph master (Octopus dev), without that you don’t have any fix.

When ceph/ceph#28589 gets merged, ceph/daemon-base:latest-nautilus-devel will get the fix and eventually/hopefully Ceph 14.2.2 which is imminent.

In my deployed rook, ceph version is 14.2.4, but the problem is still there:

core@master3 ~ $ /opt/bin/kube/kubectl -n rook-ceph logs po/rook-ceph-osd-0-6b685b5478-nzsnt 2019-12-09 15:11:58.179 7f3478524dc0 -1 Falling back to public interface 2019-12-09 15:11:58.194 7f3478524dc0 -1 Processor – bind unable to bind to v2:10.244.70.5:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address 2019-12-09 15:11:58.194 7f3478524dc0 -1 Processor – bind was unable to bind. Trying again in 5 seconds 2019-12-09 15:12:03.209 7f3478524dc0 -1 Processor – bind unable to bind to v2:10.244.70.5:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address 2019-12-09 15:12:03.210 7f3478524dc0 -1 Processor – bind was unable to bind. Trying again in 5 seconds 2019-12-09 15:12:08.225 7f3478524dc0 -1 Processor – bind unable to bind to v2:10.244.70.5:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address 2019-12-09 15:12:08.225 7f3478524dc0 -1 Processor – bind was unable to bind after 3 attempts: (99) Cannot assign requested address

nokiatesting on Dec 10, 2019

Set allowUnsupported: true in your CR.

leseb on Jul 8, 2019

and override this config option: ms_learn_addr_from_peer: false

Something like this?

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: rook-ceph
  name: rook-config-override
data:
  config: |
    [global]
    ms_learn_addr_from_peer = false

While I’m at it, shall I raise the debug level?

carletes on Jun 17, 2019

Perhaps sprinkling log statements in pick_address.cc and comparing the outputs of ceph-osd with hostNetwork: true (where it works) and hostNetwork: false could be of help …

I seem to be having the same problem as you only I have just one single OSD which refuses to start and as I verified with ceph osd dump also has an IP with x.x.x.0, the rest start just fine. However, I’m completely lost on how to start debugging this. The entry of participating has become enormous as I simply don’t know how to create a docker image with a custom Ceph build.

hansbogert on Jun 16, 2019