aws-parallelcluster: ComputeFleet:Create fails in Cloudformation, but Compute Auto-Scaling Group is still created?

I am setting up a pcluster, and I have repeatedly gotten this error when I run create cluster:

  • AWS::CloudFormation::Stack parallelcluster-galaxy-HPC The following resource (s) failed to create: [ComputeFleet].
  • AWS::AutoScaling::AutoScalingGroup ComputeFleet Received 2 FAILURE signal(s) out of 2. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

This error message is reproducible throughout various changes to private and public subnets, OS, schedulers, cluster types, and groups. Sometimes I get 1 FAILURE signal, sometimes I get 2, I cannot predict when either case happens. Regardless, the resource is still created and I can see the resource, and it’s activity, in the console.

Below is the environment I’m trying to use:

Environment:

AWS ParallelCluster version 2.4

[aws]
aws_region_name = us-east-1

[cluster HPC]
key_name = *****
vpc_settings = vpc
base_os: alinux
scheduler: slurm
initial_queue_size = 2
maintain_initial_size = true
placement_group = DYNAMIC
placement = compute
master instance type = c5.large
compute instance type = c5.large

[vpc vpc]
vpc_id: vpc-*****
master_subnet_id = subnet-*****
vpc_security_group_id = sg-*****

[global]
cluster_template = galaxy-HPC
update_check = true
sanity_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Additional context:

  • Currently I am only trying to get the stack to create successfully, there are no pre or post install scripts.
  • My security group allows inbound traffic from ports 22, 80, 443, 8080, 8081, 8443. It also allows all traffic on any port coming from EC2’s sharing the same security group.
  • As of right now, my Master node creates successfully; there is no termination and re initializing so I can SSH in. However, in the compute fleet, all my EC2 instances are created, fail a health check, are terminated, and new EC2’s are created, over and over. Thus I cannot SSH into my compute nodes at the moment.

Attached are my log files as requested, as well as my jobwatcher, slurmctld, and sqswatcher logs. I would greatly appreciate any help!

cloud-init.txt cloud-init-output.txt cfn-init.txt jobwatcher.txt sqswatcher.txt slurmctld.txt

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Here is my unbound.conf:

apt-get install -y unbound

/etc/unbound/unbound.conf.d/CUSTOMDNS.conf
server:
        verbosity: 1
        ## Specify the interface address to listen on:
        interface: 127.0.0.1
        do-ip4: yes
        do-ip6: yes
        do-udp: yes
        do-tcp: yes
        do-daemonize: yes
        access-control: 127.0.0.0/8 allow
remote-control:
        control-enable: no
forward-zone:
        name: "."
        ## This is a special IP in AWS for DNS resolution
        # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html
        # https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html
        forward-addr: 169.254.169.253
forward-zone:
        name: "[INSERT CUSTOM DOMAIN]"
        ## Your custom IP addresses, and custom DNS
        forward-addr: [CUSTOM IP]
        forward-addr: [CUSTOM IP]

Here is my resolved.conf:

/etc/systemd/resolved.conf
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See resolved.conf(5) for details

[Resolve]
DNS=127.0.0.1
Domains=ec2.internal

And finally I ran this script, provided by #597, to change /etc/resolv.conf:

#!/bin/bash

RESOLV=/etc/resolv.conf
RESOLV_ORIG=/etc/resolv.conf.orig

/bin/cp $RESOLV $RESOLV_ORIG

/bin/sed 's#search#search ec2.internal#' $RESOLV_ORIG > $RESOLV

@jcpasion If you want to writeup your setup with unbound we can post on the github wiki so everyone can benefit. Thanks!