openshift-ansible: OKD 3.10 scale up playbook hang
Description
Provide a brief description of your issue here. For example:
I recently installed OKD 3.10 cluster, 1 master 2 compute 1 infra. today I wanted to scale up another compute node, the play book hangs for task: TASK [Approve node certificates when bootstrapping]
It’s wired that when I try oc get node, the new node is already in ready status, and I can deploy service into the new node. but I can’t see the log of the pod.
I get below error when I ran diagnostic: ERROR: [DNet2011 from diagnostic NetworkCheck@openshift/origin/pkg/oc/admin/diagnostics/diagnostics/cluster/network/run_pod.go:219] [Creating remote tar locally failed: error dialing backend: dial tcp: lookup openshiftnode5.aidoin.com on 10.18.32.131:53: no such host, , Deleting remote logdir “/tmp/openshift/nodes/openshiftnode5.aidoin.com” on node “openshiftnode5.aidoin.com” failed: error dialing backend: dial tcp: lookup openshiftnode5.aidoin.com on 10.18.32.131:53: no such host, ]
Version
Please put the following version information in the code block indicated below.
- Your ansible version per
ansible --version
If you’re operating from a git clone:
- The output of
git describe
If you’re running from playbooks installed via RPM
- The output of
rpm -q openshift-ansible
Place the output between the code block below:
openshift-ansible-openshift-ansible-3.10.36-1
Steps To Reproduce
- install a cluster
- add a new node
Inventory file are like below: inventory.txt
Expected Results
Describe what you expected to happen. node are up and running correctly
Example command and output or error messages
Observed Results
the play book hang, and node is not running correctly
Additional Information
[root@openshift openshift-ansible-openshift-ansible-3.10.36-1]# oc get csr NAME AGE REQUESTOR CONDITION csr-td2cx 36m system:node:openshiftnode5.aidoin.com Approved,Issued node-csr-SavYdlvGiC1o1Rh6TqDsXvYWolrBKcf0jOxML2TN87E 36m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued [root@openshift openshift-ansible-openshift-ansible-3.10.36-1]#
[root@openshift openshift-ansible-openshift-ansible-3.10.36-1]# oc get node NAME STATUS ROLES AGE VERSION openshift.aidoin.com Ready master 7d v1.10.0+b81c8f8 openshiftnode2.aidoin.com Ready compute 7d v1.10.0+b81c8f8 openshiftnode3.aidoin.com Ready compute 7d v1.10.0+b81c8f8 openshiftnode4.aidoin.com Ready infra 7d v1.10.0+b81c8f8 openshiftnode5.aidoin.com Ready compute 37m v1.10.0+b81c8f8 [root@openshift openshift-ansible-openshift-ansible-3.10.36-1]#
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 27 (9 by maintainers)
@Mikedu1988 @aland-zhang @BrianHolsen
We don’t support /etc/hosts for nodes. You must have working DNS, that is likely the cause of these issues.
@Mikedu1988 please create a gist/pastebin of verbose output (ansible-playbook -vvv) for task ‘Approve node certificates when bootstrapping’
Are there any pending csrs? ‘oc get csr’ ? If so, please run: ‘oc get csr -ojson’
and share output.
In either case, run: ‘oc get --raw /api/v1/nodes/openshiftnode5.aidoin.com/proxy/healthz’
We use that api endpoint to determine if the node’s server certificate is valid and working. If that fails and you don’t have any outstanding csrs, you may be having network or firewall issues, or the node itself may be broken in some way. You’ll need to troubleshoot.