openshift-ansible: openshift_control_plane : pods failed to appear
Description
Hi, I’m trying to install openshift v3.11 cluster on openstack using openshift-ansible. However , the playbook “deploy_cluster.yml” is encountering the error below:
TASK [openshift_control_plane : Wait for control plane pods to appear] ******************************************************************
FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (59 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (58 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (57 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (56 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (55 retries left).
Version
-
docker version Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-91.git07f3374.el7.centos.x86_64
-
ansible --version ansible 2.7.8
-
rpm -qa | grep openshift openshift-ansible-roles-3.11.37-1.git.0.3b8b341.el7.noarch openshift-ansible-3.11.37-1.git.0.3b8b341.el7.noarch centos-release-openshift-origin311-1-2.el7.centos.noarch openshift-ansible-playbooks-3.11.37-1.git.0.3b8b341.el7.noarch openshift-ansible-docs-3.11.37-1.git.0.3b8b341.el7.noarch
-
git describe openshift-ansible-3.11.90-1-12-g1ea6332
Steps To Reproduce
- run playbook/prerequisites.yml
- run playbook/deploy_cluster.yml
Expected Results
The cluster to be deployed
Example command and output or error messages
#tailf /var/log/messages
master origin-node: E0320 03:40:04.028981 38205 reflector.go:136] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://master.lab.example.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster.lab.example.com&limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
#docker logs --tail -10 b9b0cfc5f98a
E0320 14:27:08.073902 1 leaderelection.go:234] error retrieving resource lock kube-system/openshift-master-controllers: Get https://master.lab.example.com:8443/api/v1/namespaces/kube-system/configmaps/openshift-master-controllers: dial tcp 192.168.1.5:8443: connect: connection refused
Additional Information
[root@master ~]# telnet 192.168.1.5 8443
Trying 192.168.1.5...
telnet: connect to address 192.168.1.5: Connection refused
[root@master ~]# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:10250 0.0.0.0:* LISTEN 79549/hyperkube
tcp 0 0 192.168.1.5:2379 0.0.0.0:* LISTEN 83046/etcd
tcp 0 0 192.168.1.5:2380 0.0.0.0:* LISTEN 83046/etcd
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 0.0.0.0:20048 0.0.0.0:* LISTEN 3894/rpc.mountd
tcp 0 0 0.0.0.0:53682 0.0.0.0:* LISTEN 3886/rpc.statd
tcp 0 0 172.17.0.1:53 0.0.0.0:* LISTEN 3803/dnsmasq
tcp 0 0 192.168.1.5:53 0.0.0.0:* LISTEN 3803/dnsmasq
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 3887/sshd
tcp 0 0 127.0.0.1:56921 0.0.0.0:* LISTEN 79549/hyperkube
tcp 0 0 0.0.0.0:2049 0.0.0.0:* LISTEN -
Any idea how to fix this please ? Thanks!
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 25
nicely done!
Hi All,
I’m posting this solution here as I’m hoping that this will help future me (and others) while troubleshooting this. I’ve been struggling with this issue for almost a week on both OCP and OKD. I’m also posting this solution here as this ticket has, so far, the most complete troubleshoot steps list to help address this issue.
The issue described here sometimes manifest as
TLS error(https://github.com/openshift/openshift-ansible/issues/11375#issuecomment-475866704), asUnable to connect to the server: forbidden(https://github.com/openshift/openshift-ansible/issues/11444, https://github.com/openshift/openshift-ansible/issues/10606) and asEmpty /etc/cni/net.d(https://github.com/openshift/openshift-ansible/issues/7967#issue-314580503, https://bugzilla.redhat.com/show_bug.cgi?id=1592010 and https://bugzilla.redhat.com/show_bug.cgi?id=1635257).On my investigation, I used CentOS 7.6 and RHEL 7.6, OKD 3.11 and OCP 3.11.82 under vagrant. And, to me, the fact that I was using a virtual machine with more than one working NIC has something to do with this issue. Now, I am not sure about how the whole orchestration works or or what triggers what, but following this and this, this is the procedure I followed to overcome the issue:
If, after doing all that, you see that
/etc/cni/net.d/80-openshift-network.confis not created and therefore you hit any of the three issues above, create the file with the content below while you’re waiting for control plane pods to appear and restart node service:Again, I don’t understand why - but creating the file before you wait for control plane pods to appear has no effect. Also, I could see that sometimes after you restart node service the file vanishes. If you recreate it and restart the service the SDN reappears.
I could reproduce this fix both on OCP and OKD. In one of my attempts, I’ve bumped into this issue here, but then I restarted the server and ran deploy-cluster.yml again and it succeeded (like it did here).