aws-parallelcluster: ComputeFleet:Create fails in Cloudformation, but Compute Auto-Scaling Group is still created?

I am setting up a pcluster, and I have repeatedly gotten this error when I run create cluster:

AWS::CloudFormation::Stack parallelcluster-galaxy-HPC The following resource (s) failed to create: [ComputeFleet].
AWS::AutoScaling::AutoScalingGroup ComputeFleet Received 2 FAILURE signal(s) out of 2. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

This error message is reproducible throughout various changes to private and public subnets, OS, schedulers, cluster types, and groups. Sometimes I get 1 FAILURE signal, sometimes I get 2, I cannot predict when either case happens. Regardless, the resource is still created and I can see the resource, and it’s activity, in the console.

Below is the environment I’m trying to use:

Environment:

AWS ParallelCluster version 2.4

[aws]
aws_region_name = us-east-1

[cluster HPC]
key_name = *****
vpc_settings = vpc
base_os: alinux
scheduler: slurm
initial_queue_size = 2
maintain_initial_size = true
placement_group = DYNAMIC
placement = compute
master instance type = c5.large
compute instance type = c5.large

[vpc vpc]
vpc_id: vpc-*****
master_subnet_id = subnet-*****
vpc_security_group_id = sg-*****

[global]
cluster_template = galaxy-HPC
update_check = true
sanity_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Additional context:

Currently I am only trying to get the stack to create successfully, there are no pre or post install scripts.
My security group allows inbound traffic from ports 22, 80, 443, 8080, 8081, 8443. It also allows all traffic on any port coming from EC2’s sharing the same security group.
As of right now, my Master node creates successfully; there is no termination and re initializing so I can SSH in. However, in the compute fleet, all my EC2 instances are created, fail a health check, are terminated, and new EC2’s are created, over and over. Thus I cannot SSH into my compute nodes at the moment.

Attached are my log files as requested, as well as my jobwatcher, slurmctld, and sqswatcher logs. I would greatly appreciate any help!

cloud-init.txt cloud-init-output.txt cfn-init.txt jobwatcher.txt sqswatcher.txt slurmctld.txt

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 15 (7 by maintainers)

Most upvoted comments

Here is my unbound.conf:

apt-get install -y unbound

/etc/unbound/unbound.conf.d/CUSTOMDNS.conf
server:
        verbosity: 1
        ## Specify the interface address to listen on:
        interface: 127.0.0.1
        do-ip4: yes
        do-ip6: yes
        do-udp: yes
        do-tcp: yes
        do-daemonize: yes
        access-control: 127.0.0.0/8 allow
remote-control:
        control-enable: no
forward-zone:
        name: "."
        ## This is a special IP in AWS for DNS resolution
        # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html
        # https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html
        forward-addr: 169.254.169.253
forward-zone:
        name: "[INSERT CUSTOM DOMAIN]"
        ## Your custom IP addresses, and custom DNS
        forward-addr: [CUSTOM IP]
        forward-addr: [CUSTOM IP]

Here is my resolved.conf:

/etc/systemd/resolved.conf
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See resolved.conf(5) for details

[Resolve]
DNS=127.0.0.1
Domains=ec2.internal

And finally I ran this script, provided by #597, to change /etc/resolv.conf:

#!/bin/bash

RESOLV=/etc/resolv.conf
RESOLV_ORIG=/etc/resolv.conf.orig

/bin/cp $RESOLV $RESOLV_ORIG

/bin/sed 's#search#search ec2.internal#' $RESOLV_ORIG > $RESOLV

jcpasion on Jul 22, 2019

@jcpasion If you want to writeup your setup with unbound we can post on the github wiki so everyone can benefit. Thanks!

sean-smith on Jul 22, 2019