aws-parallelcluster: ComputeFleet:Create fails in Cloudformation, but Compute Auto-Scaling Group is still created?
I am setting up a pcluster, and I have repeatedly gotten this error when I run create cluster:
- AWS::CloudFormation::Stack parallelcluster-galaxy-HPC The following resource (s) failed to create: [ComputeFleet].
- AWS::AutoScaling::AutoScalingGroup ComputeFleet Received 2 FAILURE signal(s) out of 2. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
This error message is reproducible throughout various changes to private and public subnets, OS, schedulers, cluster types, and groups. Sometimes I get 1 FAILURE signal, sometimes I get 2, I cannot predict when either case happens. Regardless, the resource is still created and I can see the resource, and it’s activity, in the console.
Below is the environment I’m trying to use:
Environment:
AWS ParallelCluster version 2.4
[aws]
aws_region_name = us-east-1
[cluster HPC]
key_name = *****
vpc_settings = vpc
base_os: alinux
scheduler: slurm
initial_queue_size = 2
maintain_initial_size = true
placement_group = DYNAMIC
placement = compute
master instance type = c5.large
compute instance type = c5.large
[vpc vpc]
vpc_id: vpc-*****
master_subnet_id = subnet-*****
vpc_security_group_id = sg-*****
[global]
cluster_template = galaxy-HPC
update_check = true
sanity_check = true
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
Additional context:
- Currently I am only trying to get the stack to create successfully, there are no pre or post install scripts.
- My security group allows inbound traffic from ports 22, 80, 443, 8080, 8081, 8443. It also allows all traffic on any port coming from EC2’s sharing the same security group.
- As of right now, my Master node creates successfully; there is no termination and re initializing so I can SSH in. However, in the compute fleet, all my EC2 instances are created, fail a health check, are terminated, and new EC2’s are created, over and over. Thus I cannot SSH into my compute nodes at the moment.
Attached are my log files as requested, as well as my jobwatcher, slurmctld, and sqswatcher logs. I would greatly appreciate any help!
cloud-init.txt cloud-init-output.txt cfn-init.txt jobwatcher.txt sqswatcher.txt slurmctld.txt
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (7 by maintainers)
Here is my unbound.conf:
Here is my resolved.conf:
And finally I ran this script, provided by #597, to change
/etc/resolv.conf
:@jcpasion If you want to writeup your setup with unbound we can post on the github wiki so everyone can benefit. Thanks!