openshift-ansible: openshift_control_plane : pods failed to appear

Description

Hi, I’m trying to install openshift v3.11 cluster on openstack using openshift-ansible. However , the playbook “deploy_cluster.yml” is encountering the error below:

TASK [openshift_control_plane : Wait for control plane pods to appear] ******************************************************************
FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (59 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (58 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (57 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (56 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (55 retries left).

Version

docker version Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-91.git07f3374.el7.centos.x86_64
ansible --version ansible 2.7.8
rpm -qa | grep openshift openshift-ansible-roles-3.11.37-1.git.0.3b8b341.el7.noarch openshift-ansible-3.11.37-1.git.0.3b8b341.el7.noarch centos-release-openshift-origin311-1-2.el7.centos.noarch openshift-ansible-playbooks-3.11.37-1.git.0.3b8b341.el7.noarch openshift-ansible-docs-3.11.37-1.git.0.3b8b341.el7.noarch
git describe openshift-ansible-3.11.90-1-12-g1ea6332

Steps To Reproduce

run playbook/prerequisites.yml
run playbook/deploy_cluster.yml

Expected Results

The cluster to be deployed

Example command and output or error messages

#tailf /var/log/messages

master origin-node: E0320 03:40:04.028981   38205 reflector.go:136] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://master.lab.example.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster.lab.example.com&limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused

#docker logs --tail -10 b9b0cfc5f98a

E0320 14:27:08.073902       1 leaderelection.go:234] error retrieving resource lock kube-system/openshift-master-controllers: Get https://master.lab.example.com:8443/api/v1/namespaces/kube-system/configmaps/openshift-master-controllers: dial tcp 192.168.1.5:8443: connect: connection refused

Additional Information

[root@master ~]# telnet 192.168.1.5 8443
Trying 192.168.1.5...
telnet: connect to address 192.168.1.5: Connection refused

[root@master ~]# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:10250           0.0.0.0:*               LISTEN      79549/hyperkube
tcp        0      0 192.168.1.5:2379        0.0.0.0:*               LISTEN      83046/etcd
tcp        0      0 192.168.1.5:2380        0.0.0.0:*               LISTEN      83046/etcd
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd
tcp        0      0 0.0.0.0:20048           0.0.0.0:*               LISTEN      3894/rpc.mountd
tcp        0      0 0.0.0.0:53682           0.0.0.0:*               LISTEN      3886/rpc.statd
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      3803/dnsmasq
tcp        0      0 192.168.1.5:53          0.0.0.0:*               LISTEN      3803/dnsmasq
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      3887/sshd
tcp        0      0 127.0.0.1:56921         0.0.0.0:*               LISTEN      79549/hyperkube
tcp        0      0 0.0.0.0:2049            0.0.0.0:*               LISTEN      -

Any idea how to fix this please ? Thanks!

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 25

Most upvoted comments

nicely done!

nagonzalez on Apr 15, 2019

Hi All,

I’m posting this solution here as I’m hoping that this will help future me (and others) while troubleshooting this. I’ve been struggling with this issue for almost a week on both OCP and OKD. I’m also posting this solution here as this ticket has, so far, the most complete troubleshoot steps list to help address this issue.

The issue described here sometimes manifest as TLS error (https://github.com/openshift/openshift-ansible/issues/11375#issuecomment-475866704), as Unable to connect to the server: forbidden (https://github.com/openshift/openshift-ansible/issues/11444, https://github.com/openshift/openshift-ansible/issues/10606) and as Empty /etc/cni/net.d (https://github.com/openshift/openshift-ansible/issues/7967#issue-314580503, https://bugzilla.redhat.com/show_bug.cgi?id=1592010 and https://bugzilla.redhat.com/show_bug.cgi?id=1635257).

On my investigation, I used CentOS 7.6 and RHEL 7.6, OKD 3.11 and OCP 3.11.82 under vagrant. And, to me, the fact that I was using a virtual machine with more than one working NIC has something to do with this issue. Now, I am not sure about how the whole orchestration works or or what triggers what, but following this and this, this is the procedure I followed to overcome the issue:

Ensure before running pre-requisites.yml that:
- /etc/dnsmasq.d/origin-upstream-dns.conf has a valid upstream DNS entry (very useful if you’re using vagrant and has bridged/NAT’ed interfaces). Also test that the node can successfully resolve DNS.
- If you’re using a fake DNS (as example.com), create an entry on dnsmasq:
```
$ cat /etc/dnsmasq.d/foo.example.com.conf
address=/foo.example.com/192.168.1.30
```
- Confirm that /etc/resolv.conf has your node’s IP (in my case, since I had a bridged NIC, that IP must be the one).
- Finally If you have more than one NIC, that the right one is the default route (also useful if on vagrant). You can change the route by running:
```
$ ip route delete default
$ ip route add default via <correct-gateway-ip>
```

If, after doing all that, you see that /etc/cni/net.d/80-openshift-network.conf is not created and therefore you hit any of the three issues above, create the file with the content below while you’re waiting for control plane pods to appear and restart node service:

$ cat /etc/cni/net.d/80-openshift-network.conf
{
  "cniVersion": "0.2.0",
  "name": "openshift-sdn",
  "type": "openshift-sdn"
}

$ systemctl restart origin-node.service
# If OCP, then: systemctl restart atomic-openshift-node.service

Again, I don’t understand why - but creating the file before you wait for control plane pods to appear has no effect. Also, I could see that sometimes after you restart node service the file vanishes. If you recreate it and restart the service the SDN reappears.

I could reproduce this fix both on OCP and OKD. In one of my attempts, I’ve bumped into this issue here, but then I restarted the server and ran deploy-cluster.yml again and it succeeded (like it did here).

slaterx on Apr 8, 2019