openshift-ansible: failed installation - kernel: vxlan: Cannot bind port 4789, err=-97

Description

After attempting a multi-master installation, nodes will not start.

Version

Please put the following version information in the code block indicated below.

ansible 2.2.0.0 config file = /etc/ansible/ansible.cfg configured module search path = Default w/o overrides

openshift-ansible-3.6.9-1

master branch

installing via playbooks

Steps To Reproduce

ansible-playbook ~/openshift-ansible/playbooks/byo/openshift-cluster/config.yml

Expected Results

Successful installation.

Observed Results

Describe what is actually happening.

FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left). fatal: [REPLACED_HOST_NAME]: FAILED! => {“attempts”: 1, “changed”: false, “failed”: true, “msg”: “Unable to start service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.\n”}

For long output or logs, consider using a gist

Additional Information

Excerpt from journalctl -xe

Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME origin-node[23418]: F0325 11:29:52.237522   23418 node.go:350] error: SDN node startup failed: Allocated ofport (-1) did not match request (1)
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME systemd[1]: Failed to start Origin Node.
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME systemd[1]: Unit origin-node.service entered failed state.
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service failed.
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service holdoff time over, scheduling restart.
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME systemd[1]: Cannot add dependency job for unit iptables.service, ignoring: Unit is masked.
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME systemd[1]: Starting Origin Node...
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.762622   23484 node.go:61] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "MASTER-NODE-REPLACED-NAME" (IP ""), iptables sync period "30s"
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.773608   23484 docker.go:418] Connecting to docker on unix:///var/run/docker.sock
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.773638   23484 docker.go:438] Start docker client with request timeout=2m0s
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.778087   23484 start_node.go:303] Starting node MASTER-NODE-REPLACED-NAME (v1.4.1)
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.779433   23484 start_node.go:312] Connecting to API server https://MASTER-REPLACED-NAME:8443
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.780783   23484 docker.go:418] Connecting to docker on unix:///var/run/docker.sock
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.780855   23484 docker.go:438] Start docker client with request timeout=0s
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME dockerd-current[16015]: time="2017-03-25T11:29:57.781496605-04:00" level=info msg="{Action=_ping, LoginUID=4294967295, PID=23484}"
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.782374   23484 node.go:141] Connecting to Docker at unix:///var/run/docker.sock
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.880860   23484 manager.go:140] cAdvisor running in container: "/system.slice/origin-node.service"
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: W0325 11:29:57.956063   23484 manager.go:148] unable to connect to Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: getsockopt: connection refused
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME ovs-vsctl[23532]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-br br0 -- add-br br0 -- set Bridge br0 fail-mode=secure protocols=OpenFlow13
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME ovs-vsctl[23533]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port br0 vxlan0
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME ovs-vsctl[23534]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 vxlan0 -- set Interface vxlan0 ofport_request=1 type=vxlan "options:remote_ip=\"flow\"" "options:key=\"flow\""
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME kernel: vxlan: Cannot bind port 4789, err=-97
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: F0325 11:29:57.996792   23484 node.go:350] error: SDN node startup failed: Allocated ofport (-1) did not match request (1)
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Mar 25 11:29:58 MASTER-NODE-REPLACED-NAME systemd[1]: Failed to start Origin Node.
Mar 25 11:29:58 MASTER-NODE-REPLACED-NAME systemd[1]: Unit origin-node.service entered failed state.
Mar 25 11:29:58 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service failed.

Red Hat Enterprise Linux Server release 7.3 (Maipo)

Inventory file:

# Create an OSEv3 group that contains the masters and nodes groups
[OSEv3:children]
masters
nodes
etcd
lb

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
ansible_ssh_user=root
deployment_type=origin
enable_excluders=false

# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
# openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/openshift/openshift-passwd'}]

# master cluster ha variables using pacemaker or RHEL HA
openshift_master_cluster_hostname=cloud-master.REPLACED.DOMAIN
openshift_master_cluster_public_hostname=cloud-master.REPLACED.DOMAIN
openshift_master_cluster_method=native

# apply updated node defaults
openshift_node_kubelet_args={'pods-per-core': ['10'], 'max-pods': ['250'], 'image-gc-high-threshold': ['90'], 'image-gc-low-threshold': ['80']}

# override the default controller lease ttl
#osm_controller_lease_ttl=30

# enable ntp on masters to ensure proper failover
openshift_clock_enabled=true

# osm_default_subdomain=cloud.REPLACED.DOMAIN

# host group for masters
[masters]
cloud-m1e.REPLACED.DOMAIN
cloud-m2e.REPLACED.DOMAIN
cloud-m1w.REPLACED.DOMAIN

# host group for etcd
[etcd]
cloud-e1e.REPLACED.DOMAIN
cloud-e1w.REPLACED.DOMAIN
cloud-e2w.REPLACED.DOMAIN

# Specify load balancer host
[lb]
cloud-master.REPLACED.DOMAIN

# host group for nodes, includes region info
[nodes]
cloud-m1e.REPLACED.DOMAIN openshift_schedulable=False openshift_node_labels="{'region': 'infra', 'zone': 'default-east'}"
cloud-m2e.REPLACED.DOMAIN openshift_schedulable=False openshift_node_labels="{'region': 'infra', 'zone': 'default-east'}"
cloud-m1w.REPLACED.DOMAIN openshift_schedulable=False openshift_node_labels="{'region': 'infra', 'zone': 'default-west'}"
cloud-h1e.REPLACED.DOMAIN openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
cloud-h2e.REPLACED.DOMAIN openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
cloud-h1w.REPLACED.DOMAIN openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
cloud-h2w.REPLACED.DOMAIN openshift_node_labels="{'region': 'primary', 'zone': 'west'}"

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 28 (10 by maintainers)

Most upvoted comments

Having the same issue as well (install fails, cannot bind to port 4789). Installing OpenShift 3.5. Previously, I did get OpenShift 3.5 to work on a AWS RHEL 7.3 machine that was running the 3.10.0-514.10.2.el7.x86_64 kernel. However, now the client I am working with is running a RHEL 7.3 VM with a 3.10.0-514.16.1.el7.x86_64 kernel, which is where the failed VMs landed after a 7.2 to a 7.3 upgrade. I am running openvswitch-2.6.1-10.git20161206.el7fdp.x86_64 on both the good cluster and the ones that are failing.

UPDATE 1: Downgraded the kernel to 3.10.0-514.10.2.el7.x86_64 and still having the same problems.

UPDATE 2: The issue is because IPV6 is disabled. So edit /etc/default/grub and remove the entry ipv6.disable=1 from the GRUB_CMDLINE_LINUX line. Run grub2-mkconfig -o /boot/grub2/grub.cfg and reboot the VMs. That should fix the issue. Enabling IPV6 is at: https://access.redhat.com/solutions/8709. This solution is thanks to this discussion: https://botbot.me/freenode/openshift-dev/2017-03-06/?page=5

Ok, bug is now public.

https://bugzilla.redhat.com/show_bug.cgi?id=1445054 There’s also a KCS article for Red Hat Customers https://access.redhat.com/solutions/3039771

Disabling IPv6 breaking vxlan is being tracked as a kernel bug, unfortunately that bug is private right now, I’ve requested that it be made public as there’s no sensitive data in the bug. If that happens I’ll close this with a reference to that bug.

Assuming you’re using the master branch, try moving back to release-1.2