openshift-ansible: OKD 3.10 scale up playbook hang

Description

Provide a brief description of your issue here. For example:

I recently installed OKD 3.10 cluster, 1 master 2 compute 1 infra. today I wanted to scale up another compute node, the play book hangs for task: TASK [Approve node certificates when bootstrapping]

It’s wired that when I try oc get node, the new node is already in ready status, and I can deploy service into the new node. but I can’t see the log of the pod.

I get below error when I ran diagnostic: ERROR: [DNet2011 from diagnostic NetworkCheck@openshift/origin/pkg/oc/admin/diagnostics/diagnostics/cluster/network/run_pod.go:219] [Creating remote tar locally failed: error dialing backend: dial tcp: lookup openshiftnode5.aidoin.com on 10.18.32.131:53: no such host, , Deleting remote logdir “/tmp/openshift/nodes/openshiftnode5.aidoin.com” on node “openshiftnode5.aidoin.com” failed: error dialing backend: dial tcp: lookup openshiftnode5.aidoin.com on 10.18.32.131:53: no such host, ]

Version

Please put the following version information in the code block indicated below.

Your ansible version per ansible --version

If you’re operating from a git clone:

The output of git describe

If you’re running from playbooks installed via RPM

The output of rpm -q openshift-ansible

Place the output between the code block below:

openshift-ansible-openshift-ansible-3.10.36-1

Steps To Reproduce

install a cluster
add a new node

Inventory file are like below: inventory.txt

Expected Results

Describe what you expected to happen. node are up and running correctly

Example command and output or error messages

Observed Results

the play book hang, and node is not running correctly

Additional Information

[root@openshift openshift-ansible-openshift-ansible-3.10.36-1]# oc get csr NAME AGE REQUESTOR CONDITION csr-td2cx 36m system:node:openshiftnode5.aidoin.com Approved,Issued node-csr-SavYdlvGiC1o1Rh6TqDsXvYWolrBKcf0jOxML2TN87E 36m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued [root@openshift openshift-ansible-openshift-ansible-3.10.36-1]#

[root@openshift openshift-ansible-openshift-ansible-3.10.36-1]# oc get node NAME STATUS ROLES AGE VERSION openshift.aidoin.com Ready master 7d v1.10.0+b81c8f8 openshiftnode2.aidoin.com Ready compute 7d v1.10.0+b81c8f8 openshiftnode3.aidoin.com Ready compute 7d v1.10.0+b81c8f8 openshiftnode4.aidoin.com Ready infra 7d v1.10.0+b81c8f8 openshiftnode5.aidoin.com Ready compute 37m v1.10.0+b81c8f8 [root@openshift openshift-ansible-openshift-ansible-3.10.36-1]#

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 27 (9 by maintainers)

Most upvoted comments

@Mikedu1988 @aland-zhang @BrianHolsen

We don’t support /etc/hosts for nodes. You must have working DNS, that is likely the cause of these issues.

michaelgugino on Sep 4, 2018

@Mikedu1988 please create a gist/pastebin of verbose output (ansible-playbook -vvv) for task ‘Approve node certificates when bootstrapping’

Are there any pending csrs? ‘oc get csr’ ? If so, please run: ‘oc get csr -ojson’

and share output.

In either case, run: ‘oc get --raw /api/v1/nodes/openshiftnode5.aidoin.com/proxy/healthz’

We use that api endpoint to determine if the node’s server certificate is valid and working. If that fails and you don’t have any outstanding csrs, you may be having network or firewall issues, or the node itself may be broken in some way. You’ll need to troubleshoot.

michaelgugino on Aug 29, 2018