openshift-ansible: when installing on AWS, node hostname and checked pod names don't match

Description

openshift-ansible-3.11.0-0.16.0
ansible-2.6.2-1.el7.noarch (epel)
Steps To Reproduce
  1. run installer on aws-based hosts
Expected Results

successful installation

Observed Results

Describe what is actually happening.

2018-08-17 12:37:20,201 p=19271 u=root |  failed: [master1.ceijaug.internal] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "results": {"cmd": "/bin/oc get pod master-etcd-ip-192-199-0-7.ec2.i
nternal -o json -n kube-system", "results": [{}], "returncode": 0, "stderr": "Error from server (NotFound): pods \"master-etcd-ip-192-199-0-7.ec2.internal\" not found\n", "stdout": ""}, "state": "list"}
Additional Information

Provide any additional information which may help us diagnose the issue.

  • Your operating system and version, ie: RHEL 7.2, Fedora 23 ($ cat /etc/redhat-release)
  • Your inventory file (especially any non-standard configuration parameters)
  • Sample code, etc
[root@master1 ~]# oc get node
NAME                       STATUS     ROLES     AGE       VERSION
master1.ceijaug.internal   NotReady   <none>    6m        v1.11.0+d4cacc0

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 30 (14 by maintainers)

Most upvoted comments

I’m just now catching up on this issue. I’ve dug through the environment with @thoraxe

What’s happening, and I’m sorry if this is already clear to everyone else, is that the hostname of the hosts has been changed after provisioning and before the installer runs. So hostname != metadata/hostname. That would not be valid if we were configuring the cloud provider, however we’re not configuring cloud provider integration so we’re fine. However, even when the cloud provider is not configured we override facts[‘common’][‘hostname’] with the name from the metadata API. I think we should disable this metadata inspection whenever we’re not configuring the provider.

I’m trying to work through the implications of this during an upgrade however.

#9956 should fix that

In short - install on AWS would use AWS metadata service to override hostnames. If hostnames don’t match it would fail to install, so once this fix lands in release-3.10 only hostname would be used

Have you looked through the AWS metadata to see if that hostname is there? Do any of the below commands return that name?

curl http://169.254.169.254/latest/meta-data/local-hostname curl http://169.254.169.254/latest/meta-data/public-hostname curl http://169.254.169.254/latest/meta-data/hostname

It’s possibly getting populated around here

and here. 273 looks like where the values are being set

Ansible playbooks are reaching out to AWS metadata to fetch the internal DNS name and replace whatever is set in ansible inventory if the host can be reached.

How was node1.ceijaug.internal etc hostnames set?

Note that in order to override master-etcd-ip-192-199-0-7.ec2.internal you’d need Route53 config so that cloudprovider would know about this - a simple hostnamectl set-hostname won’t work