openshift-ansible: Installation fails on origin-master-api restarting attempt

Description

Installation fails on origin-master-api restarting attempt.

Version

Ansible

ansible 2.4.2.0
  config file = None
  configured module search path = [u'/home/aizi/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /home/aizi/.local/lib/python2.7/site-packages/ansible
  executable location = /home/aizi/.local/bin/ansible
  python version = 2.7.13 (default, Nov 24 2017, 17:33:09) [GCC 6.3.0 20170516]

openshift-ansible-3.9.0-0.35.0-8-g1a58f7fc7

Steps To Reproduce
  1. ansible-playbook -i os-hosts openshift-ansible/playbooks/prerequisites.yml
  2. ansible-playbook -i os-hosts openshift-ansible/playbooks/deploy_cluster.yml
Failure summary:

  1. Hosts:    master.dom
     Play:     Configure masters
     Task:     restart master api
     Message:  Unable to restart service origin-master-api: Job for origin-master-api.service failed because the control process exited with error code. See "systemctl status origin-master-api.service" and "journalctl -xe" for details.

Inventory file

[OSEv3:children]
masters
nodes
etcd

[masters]
master.dom

[nodes]
master.dom
node1.dom openshift_node_labels="{'region': 'infra','zone': 'default'}"
node2.dom
#="{'region': 'primary', 'zone': 'default'}"

[etcd]
master.dom

#[masters:vars]
#ansible_become=true

#[nodes:vars]
#ansible_become=true


[OSEv3:vars]
ansible_user=vagrant
ansible_become=true


openshift_deployment_type=origin

openshift_enable_service_catalog=false
openshift_service_catalog_image_prefix=openshift/origin-
openshift_service_catalog_image_version=latest

# You must enable Network Time Protocol (NTP) to prevent masters and nodes in the cluster from going out of sync.
openshift_clock_enabled=true

# Let's change checks values for now
openshift_disable_check=memory_availability,disk_availability
#docker_storage

prerequisites.log gist

deploy_cluster.log gist

Additional Information

As host I’m using Debian Stretch, but from a fresh CentOS I’m receiving the same error. As a vm provider I’m using virtualbox and there I have three boxes ( CentOS official box ) with 2GB RAM and 2 VCPUs each.

I’ve tried to use release-3.7 branch and openshift_release=v3.7 variable on a master branch, but got the same error.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 20 (5 by maintainers)

Most upvoted comments

I had exactly the same issue as @vrutkovs during the installation of OpenShift Origin 3.9.

The problem was that I used the wrong ip in the /etc/hosts file.

I wrote this after the first 2 default config lines: 127.0.0.1 hostname hostname.domain

The correct way would be to simply let the dns give you the right ip or use the LAN ip: 192.168.x.x hostname hostname.domain

If you used 127.0.0.1 in the /etc/hosts the origin-master-api container tries access itself on port 2379 and not the container host / master.

Where should I inject this line ? In deploy-cluster.yml ?

In the inventory file, in [OSEv3] group

Hi, same problem here on centos 7. it seems like etcd is configured to listen only on a specific interface. wouldn’t it be easiest to just listen on all interfaces, as it is done for the other services. this could be done by setting the url to listen to 0.0.0.0?

I found the issue yesterday. Official vagrant CentOS box contains this line in /etc/hosts. It’s the first line by the way.

127.0.0.1 node2.dom node2.dom # When you change hostname in /etc/hosts, you should normally rename hostname here as well.

It should be removed or commented. If CentOS is installed from scratch, this line doesn’t exist and installation works good.

I think that additional check should be added to the playbook.

Could I use for example origin/release-3.7 ?

All release-* branches are considered stable, master would install 3.9, which is not yet released though.

Hmm, interesting.

So master fails to start as it can’t connect to etcd: F0201 21:39:38.430245 1030 start_api.go:67] [could not reach etcd(v2): client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused

etcd service seems to be running, but I’ve noticed firewalld has opened 2380 instead of 2379. and iptables seems to allow 2379 and 2380 there

Could you try rerunning this with os_firewall_enabled: false? I’m not really familiar with vagrant setup, but it might something else blocking the connection

Could you also attach the output of journalctl -b -el --unit=origin-master-api.service from the master?