openshift-ansible: Could not install cluster in AWS ap-southeast-2: failed to run Kubelet: could not init cloud provider "aws"

Description

Trying to spin up a fresh cluster on aws region ap-southeast-2.

$ ansible-playbook -v -i playbooks/aws/provisioning-inventory.yml playbooks/aws/openshift-cluster/prerequisites.yml -e @playbooks/aws/provisioning_vars.yml
"Succeeds"

$ ansible-playbook -v -i playbooks/aws/provisioning-inventory.yml playbooks/aws/openshift-cluster/build_ami.yml -e @playbooks/aws/provisioning_vars.yml
"Succeeds"

$ ansible-playbook -v -i playbooks/aws/provisioning-inventory.yml playbooks/aws/openshift-cluster/provision.yml -e @playbooks/aws/provisioning_vars.yml
"Succeeds"

$ ansible-playbook -v -i playbooks/aws/provisioning-inventory.yml playbooks/aws/openshift-cluster/install.yml -e @playbooks/aws/provisioning_vars.yml
"FAILS"

Version

$ansible --version
ansible 2.6.2
  config file = /home/josha/proj/openshift-ansible/ansible.cfg
  configured module search path = ['/home/josha/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/josha/.local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.5 (default, Apr  1 2018, 05:46:30) [GCC 7.3.0]

$git describe
openshift-ansible-3.10.43-1-10-g189969f82

Steps To Reproduce

ansible-playbook -v -i playbooks/aws/provisioning-inventory.yml playbooks/aws/openshift-cluster/prerequisites.yml -e @playbooks/aws/provisioning_vars.yml
ansible-playbook -v -i playbooks/aws/provisioning-inventory.yml playbooks/aws/openshift-cluster/build_ami.yml -e @playbooks/aws/provisioning_vars.yml
ansible-playbook -v -i playbooks/aws/provisioning-inventory.yml playbooks/aws/openshift-cluster/provision.yml -e @playbooks/aws/provisioning_vars.yml
ansible-playbook -v -i playbooks/aws/provisioning-inventory.yml playbooks/aws/openshift-cluster/install.yml -e @playbooks/aws/provisioning_vars.yml

Expected Results

I expect playbooks/aws/openshift-cluster/install.yml to succeed

Observed Results

TASK [openshift_control_plane : fail] *********************************************************************************************************************************************************************************************************************************************************************************************************************
Friday 07 September 2018  09:45:06 +1000 (0:00:00.151)       0:03:05.139 ****** 
fatal: [ec2-xxx-xxx-xxx-172.ap-southeast-2.compute.amazonaws.com]: FAILED! => {"changed": false, "msg": "Node start failed."}
fatal: [ec2-xxx-xxx-xxx-85.ap-southeast-2.compute.amazonaws.com]: FAILED! => {"changed": false, "msg": "Node start failed."}
fatal: [ec2-xxx-xxx-xxx-238.ap-southeast-2.compute.amazonaws.com]: FAILED! => {"changed": false, "msg": "Node start failed."}

NO MORE HOSTS LEFT ****************************************************************************************************************************************************************************************************************************************************************************************************************************************

PLAY RECAP ************************************************************************************************************************************************************************************************************************************************************************************************************************************************
ec2-xxx-xxx-xxx-238.ap-southeast-2.compute.amazonaws.com : ok=154  changed=22   unreachable=0    failed=1   
ec2-xxx-xxx-xxx-172.ap-southeast-2.compute.amazonaws.com : ok=207  changed=27   unreachable=0    failed=1   
ec2-xxx-xxx-xxx-85.ap-southeast-2.compute.amazonaws.com : ok=154  changed=22   unreachable=0    failed=1   
localhost                  : ok=18   changed=1    unreachable=0    failed=0

Running journalctl -xe on one of the nodes, the following error shows up:

        "Sep 06 23:45:05 ip-xxx-xxx-xxx-160.ap-southeast-2.compute.internal origin-node[32115]: I0906 23:45:05.811802   32115 aws.go:1033] Building AWS cloudprovider",
        "Sep 06 23:45:05 ip-xxx-xxx-xxx-160.ap-southeast-2.compute.internal systemd[1]: Failed to start OpenShift Node.",
        "Sep 06 23:45:05 ip-xxx-xxx-xxx-160.ap-southeast-2.compute.internal origin-node[32115]: F0906 23:45:05.894033   32115 server.go:233] failed to run Kubelet: could not init cloud provider \"aws\": error finding instance i-085ddf3267cc5f2ce: \"error listing AWS instances: \\\"InvalidInstanceID.NotFound: The instance ID 'i-085ddf3267cc5f2ce' does not exist\\\\n\\\\tstatus code: 400, request id: 4cc327a9-45df-47da-9b81-2a38347a9ab9\\\"\"",

However i-085ddf3267cc5f2ce definitely does exist and is one of the nodes. I’ve done some googling and I think the issue may be that the Kublet is using the wrong AWS region to list the instances, but I’m not sure how to fix this.

Additional Information

Inventory file: playbooks/aws/provisioning-inventory.yml

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
################################################################################
# Ensure these variables are set for bootstrap
################################################################################
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
ansible_ssh_user=centos

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
openshift_master_default_subdomain=apps.openshift.<my-domain>.com
ansible_become=true

openshift_cloudprovider_kind=aws
openshift_cloudprovider_aws_access_key=<my-access-key>
openshift_cloudprovider_aws_secret_key=<my-secret-key>

# openshift_deployment_type is required for installation
openshift_deployment_type=origin

openshift_master_api_port=443

openshift_hosted_router_wait=False
openshift_hosted_registry_wait=False

openshift_clusterid=testing

################################################################################
# cluster specific settings maybe be placed here

[masters]

[etcd]

[nodes]

Provisioning vars: playbooks/aws/provisioning_vars.yml

---
openshift_deployment_type: 'origin'
openshift_release: '3.10'
openshift_pkg_version:  '-3.10.0'

openshift_aws_clusterid: 'oc-test'

openshift_aws_region: ap-southeast-2

openshift_aws_create_launch_config: true
openshift_aws_create_scale_group: true

openshift_aws_create_vpc: true

openshift_aws_vpc:
  name: "{{ openshift_aws_vpc_name }}"
  cidr: 172.31.0.0/16
  subnets:
    ap-southeast-2:
    - cidr: 172.31.48.0/20
      az: "ap-southeast-2a"
      default_az: true
    - cidr: 172.31.32.0/20
      az: "ap-southeast-2b"
    - cidr: 172.31.16.0/20
      az: "ap-southeast-2c"

openshift_aws_create_security_groups: true

openshift_aws_ssh_key_name: joshainglis_key

- key_name: joshainglis_key
  username: centos
  pub_key: |
         <my-pub-key>

openshift_aws_build_ami_ssh_user: centos

container_runtime_docker_storage_type: overlay2
container_runtime_docker_storage_setup_device: /dev/xvdb

# ap-southeast-2 Official Centos AMI
openshift_aws_base_ami: ami-d8c21dba

openshift_aws_create_s3: True

openshift_aws_elb_cert_arn: 'arn:aws:acm:ap-southeast-2:<my-aws-account>:certificate/<cert-id>'

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 15 (4 by maintainers)

Most upvoted comments

I injected the credentials manually into /etc/sysconfig/origin-node and got a bit further in the process. No idea why the credentials were missing there, I believe it must be a bug in the ansible scripts somewhere. I have no prior experience with ansible, so I’m unfortunately not able to figure out of it.

tobixen on Jan 14, 2019