openshift-ansible: failed installation - kernel: vxlan: Cannot bind port 4789, err=-97
Description
After attempting a multi-master installation, nodes will not start.
Version
Please put the following version information in the code block indicated below.
ansible 2.2.0.0 config file = /etc/ansible/ansible.cfg configured module search path = Default w/o overrides
openshift-ansible-3.6.9-1
master branch
installing via playbooks
Steps To Reproduce
ansible-playbook ~/openshift-ansible/playbooks/byo/openshift-cluster/config.yml
Expected Results
Successful installation.
Observed Results
Describe what is actually happening.
FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left). fatal: [REPLACED_HOST_NAME]: FAILED! => {“attempts”: 1, “changed”: false, “failed”: true, “msg”: “Unable to start service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.\n”}
For long output or logs, consider using a gist
Additional Information
Excerpt from journalctl -xe
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME origin-node[23418]: F0325 11:29:52.237522 23418 node.go:350] error: SDN node startup failed: Allocated ofport (-1) did not match request (1)
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME systemd[1]: Failed to start Origin Node.
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME systemd[1]: Unit origin-node.service entered failed state.
Mar 25 11:29:52 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service failed.
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service holdoff time over, scheduling restart.
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME systemd[1]: Cannot add dependency job for unit iptables.service, ignoring: Unit is masked.
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME systemd[1]: Starting Origin Node...
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.762622 23484 node.go:61] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "MASTER-NODE-REPLACED-NAME" (IP ""), iptables sync period "30s"
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.773608 23484 docker.go:418] Connecting to docker on unix:///var/run/docker.sock
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.773638 23484 docker.go:438] Start docker client with request timeout=2m0s
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.778087 23484 start_node.go:303] Starting node MASTER-NODE-REPLACED-NAME (v1.4.1)
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.779433 23484 start_node.go:312] Connecting to API server https://MASTER-REPLACED-NAME:8443
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.780783 23484 docker.go:418] Connecting to docker on unix:///var/run/docker.sock
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.780855 23484 docker.go:438] Start docker client with request timeout=0s
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME dockerd-current[16015]: time="2017-03-25T11:29:57.781496605-04:00" level=info msg="{Action=_ping, LoginUID=4294967295, PID=23484}"
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.782374 23484 node.go:141] Connecting to Docker at unix:///var/run/docker.sock
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: I0325 11:29:57.880860 23484 manager.go:140] cAdvisor running in container: "/system.slice/origin-node.service"
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: W0325 11:29:57.956063 23484 manager.go:148] unable to connect to Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: getsockopt: connection refused
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME ovs-vsctl[23532]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-br br0 -- add-br br0 -- set Bridge br0 fail-mode=secure protocols=OpenFlow13
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME ovs-vsctl[23533]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port br0 vxlan0
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME ovs-vsctl[23534]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 vxlan0 -- set Interface vxlan0 ofport_request=1 type=vxlan "options:remote_ip=\"flow\"" "options:key=\"flow\""
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME kernel: vxlan: Cannot bind port 4789, err=-97
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME origin-node[23484]: F0325 11:29:57.996792 23484 node.go:350] error: SDN node startup failed: Allocated ofport (-1) did not match request (1)
Mar 25 11:29:57 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Mar 25 11:29:58 MASTER-NODE-REPLACED-NAME systemd[1]: Failed to start Origin Node.
Mar 25 11:29:58 MASTER-NODE-REPLACED-NAME systemd[1]: Unit origin-node.service entered failed state.
Mar 25 11:29:58 MASTER-NODE-REPLACED-NAME systemd[1]: origin-node.service failed.
Red Hat Enterprise Linux Server release 7.3 (Maipo)
Inventory file:
# Create an OSEv3 group that contains the masters and nodes groups
[OSEv3:children]
masters
nodes
etcd
lb
# Set variables common for all OSEv3 hosts
[OSEv3:vars]
ansible_ssh_user=root
deployment_type=origin
enable_excluders=false
# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
# openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/openshift/openshift-passwd'}]
# master cluster ha variables using pacemaker or RHEL HA
openshift_master_cluster_hostname=cloud-master.REPLACED.DOMAIN
openshift_master_cluster_public_hostname=cloud-master.REPLACED.DOMAIN
openshift_master_cluster_method=native
# apply updated node defaults
openshift_node_kubelet_args={'pods-per-core': ['10'], 'max-pods': ['250'], 'image-gc-high-threshold': ['90'], 'image-gc-low-threshold': ['80']}
# override the default controller lease ttl
#osm_controller_lease_ttl=30
# enable ntp on masters to ensure proper failover
openshift_clock_enabled=true
# osm_default_subdomain=cloud.REPLACED.DOMAIN
# host group for masters
[masters]
cloud-m1e.REPLACED.DOMAIN
cloud-m2e.REPLACED.DOMAIN
cloud-m1w.REPLACED.DOMAIN
# host group for etcd
[etcd]
cloud-e1e.REPLACED.DOMAIN
cloud-e1w.REPLACED.DOMAIN
cloud-e2w.REPLACED.DOMAIN
# Specify load balancer host
[lb]
cloud-master.REPLACED.DOMAIN
# host group for nodes, includes region info
[nodes]
cloud-m1e.REPLACED.DOMAIN openshift_schedulable=False openshift_node_labels="{'region': 'infra', 'zone': 'default-east'}"
cloud-m2e.REPLACED.DOMAIN openshift_schedulable=False openshift_node_labels="{'region': 'infra', 'zone': 'default-east'}"
cloud-m1w.REPLACED.DOMAIN openshift_schedulable=False openshift_node_labels="{'region': 'infra', 'zone': 'default-west'}"
cloud-h1e.REPLACED.DOMAIN openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
cloud-h2e.REPLACED.DOMAIN openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
cloud-h1w.REPLACED.DOMAIN openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
cloud-h2w.REPLACED.DOMAIN openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 28 (10 by maintainers)
Having the same issue as well (install fails, cannot bind to port 4789). Installing OpenShift 3.5. Previously, I did get OpenShift 3.5 to work on a AWS RHEL 7.3 machine that was running the 3.10.0-514.10.2.el7.x86_64 kernel. However, now the client I am working with is running a RHEL 7.3 VM with a 3.10.0-514.16.1.el7.x86_64 kernel, which is where the failed VMs landed after a 7.2 to a 7.3 upgrade. I am running openvswitch-2.6.1-10.git20161206.el7fdp.x86_64 on both the good cluster and the ones that are failing.
UPDATE 1: Downgraded the kernel to 3.10.0-514.10.2.el7.x86_64 and still having the same problems.
UPDATE 2: The issue is because IPV6 is disabled. So edit /etc/default/grub and remove the entry ipv6.disable=1 from the GRUB_CMDLINE_LINUX line. Run grub2-mkconfig -o /boot/grub2/grub.cfg and reboot the VMs. That should fix the issue. Enabling IPV6 is at: https://access.redhat.com/solutions/8709. This solution is thanks to this discussion: https://botbot.me/freenode/openshift-dev/2017-03-06/?page=5
Ok, bug is now public.
https://bugzilla.redhat.com/show_bug.cgi?id=1445054 There’s also a KCS article for Red Hat Customers https://access.redhat.com/solutions/3039771
Disabling IPv6 breaking vxlan is being tracked as a kernel bug, unfortunately that bug is private right now, I’ve requested that it be made public as there’s no sensitive data in the bug. If that happens I’ll close this with a reference to that bug.
Assuming you’re using the master branch, try moving back to release-1.2