aws-parallelcluster: compute instances fail health check in endless loop
Environment: aws-parallelcluster-2.4.1 centos7 sge master: c5.9xlarge compute: c5n.18xlarge
The compute nodes never become live because they continually fail the health check on start-up and are terminated. Here’s the output from /var/log/sqswatcher on the master node:
2019-10-25 02:56:16,247 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:56:18,259 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:56:48,289 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:56:50,324 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:57:20,354 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:57:22,363 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:57:52,393 INFO [sqswatcher:_poll_queue] Refreshing cluster properties
2019-10-25 02:57:52,499 INFO [utils:get_asg_settings] min/desired/max 0/1/6
2019-10-25 02:57:52,564 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:57:54,574 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:58:24,604 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:58:26,613 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:58:56,643 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:58:58,703 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue
2019-10-25 02:58:58,739 ERROR [sqswatcher:_process_instance_terminate_event] Instance i-0012f6c570f00bcd9 not found in the database.
2019-10-25 02:58:58,739 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://queue.amazonaws.com/684353139040/parallelcluster-meredithk-test-efa-nohyper1-intel3-SQS-1Q2QW8X745LEM', receipt_handle='AQEBc+SA48ZuhUmx1xVpZQipj8SM8xXztziZmkdk1lQjLwNB+F2rGHWrbG2ZKDtvMG4VsI1ek2PgC9fcw/aY6+Q/Tt+0jEMzYZhrDtwqycJpKYdFJzWjY5/blVSNbuc1ZQTqi7QhxlKkySEZ/igX4uFTGgVoZxGw6SFrDzq9IWjn7yJ54ZyJN8rPIthi57QmkU5inlSwPV5pcj6oAftOMPzGxcxv56KoMlqmgof6RIIW66esYzm89d4zWewk+iAolrmtkzD4eJoZQS/jbQT0HMRTFMlt5ufT48WEaKt5WUyL86i6UgCwKtuINdyqi3e/CeUEtxdU+n9oPvpn2im8+vth8dzg1JughlcSsJAJCKohHSamTpSaOhd4DWOW9DOnvGpjl/KBBodTAsg/6073UEr2mE2B8Qbjir3Nt7hwVKJD8iED8YVsMp3SdAfdcyg7naR94n1sZdcvi/PTx7/3K3WY7g==')
2019-10-25 02:59:28,779 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:59:30,787 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:00:00,818 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:00:02,827 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:00:32,857 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:00:34,865 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:01:04,895 INFO [sqswatcher:_poll_queue] Refreshing cluster properties
2019-10-25 03:01:04,988 INFO [utils:get_asg_settings] min/desired/max 0/1/6
2019-10-25 03:01:05,057 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:01:07,065 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:01:37,095 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:01:39,183 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue
2019-10-25 03:01:39,218 ERROR [sqswatcher:_process_instance_terminate_event] Instance i-07a689ee5610c8c26 not found in the database.
2019-10-25 03:01:39,218 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://queue.amazonaws.com/684353139040/parallelcluster-meredithk-test-efa-nohyper1-intel3-SQS-1Q2QW8X745LEM', receipt_handle='AQEByn1myDFouo6xqD1TvU+X7fZdqAVZhrqalZ47BunjAEM4egT7VKB1bGmRvyhzge+1CdZTJUyL6iFp5e/2HeAqCmeObPsaxQkFIUr7IdoplLIEqhuufuCdo/k2Z2BwJlW5naxLgrkHdPXqXl0t/xx06fN3lEnsiC3e1mSwxRyPqk1vxtFGInr8zMLk4Y8FSok91AYXmfQ+sBwcL4xASfBoz9AU9tqqhQA2KzHZltOA891GAi/HIp+lAvYvqWqiG9g03m7iAMzNEtq4beBeqhb4jkTAi8MuziLh/7ggezcLH2H6C9W4En/pEKK98zPqQwKdBHrP4anvelEBas9AvsoBKkTotnbH1bdIrljIe9sJmZLXZTGeWh/26b1AgITyJ5W5anZtPoh4t/t2L+Q5P9yH4Y3n2pLtQxwXVzjrCdJ0txy8KcjglI3vSmsznkg3iVqV7N+dXY5RhXhrA+k+csEWkA==')
Here is the config file:
[aws]
aws_access_key_id = ###
aws_secret_access_key = ###
#aws_region_name = us-east-1
[cluster default]
key_name = fire
master_instance_type = c5.9xlarge
compute_instance_type = c5n.18xlarge
base_os = centos7
#cluster_type = spot
spot_price = 5
initial_queue_size = 0
maintain_initial_size = true
max_queue_size = 6
vpc_settings = poc_vpn
tags = {"user" : "meredithk"}
fsx_settings = custom_fs
ebs_settings = shared
fs_settings = customefs
placement_group = DYNAMIC
enable_efa = compute
# centos7
#post_install = s3://postinstallfmg/parallelcluster-postinstall-centos7-v1.sh
# alinux
#post_install = s3://postinstallfmg/parallelcluster-postinstall-v1.sh
extra_json = { "cfncluster" : { "cfn_scheduler_slots" : "cores" } }
master_root_volume_size = 50
[vpc poc_vpn]
vpc_id = vpc-b53f65d1
use_public_ips = false
# useast1a
#master_subnet_id = subnet-a6a304fe
# useast1b
master_subnet_id = subnet-0f2f6a50145d03c60
# additional_sg necessary for efs mounting
additional_sg = sg-0b39ee73
[global]
sanity_check = true
update_check = true
cluster_template = default
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[fsx custom_fs]
shared_dir = /fsx
storage_capacity = 3600
imported_file_chunk_size = 1024
import_path = s3://fmglobal-virtual-fire-scenarios
[ebs shared]
shared_dir = /shared
volume_size = 2000
[efs customefs]
shared_dir = /efs
efs_fs_id = fs-4b70dd00
I’ve attached a screenshot of the autoscaling group from the AWS console

About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 27 (13 by maintainers)
VPC with multiple CIDR blocks is now supported as part of v2.5.1: https://github.com/aws/aws-parallelcluster/releases/tag/v2.5.1
This worked! Just fyi, on centos7 the restart command is
sudo systemctl restart nfs
Could we keep this issue open so that when the bug is officially fixed I get notified?