aws-parallelcluster: compute instances fail health check in endless loop

Environment: aws-parallelcluster-2.4.1 centos7 sge master: c5.9xlarge compute: c5n.18xlarge

The compute nodes never become live because they continually fail the health check on start-up and are terminated. Here’s the output from /var/log/sqswatcher on the master node:

2019-10-25 02:56:16,247 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:56:18,259 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:56:48,289 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:56:50,324 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:57:20,354 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:57:22,363 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:57:52,393 INFO [sqswatcher:_poll_queue] Refreshing cluster properties
2019-10-25 02:57:52,499 INFO [utils:get_asg_settings] min/desired/max 0/1/6
2019-10-25 02:57:52,564 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:57:54,574 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:58:24,604 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:58:26,613 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:58:56,643 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:58:58,703 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue
2019-10-25 02:58:58,739 ERROR [sqswatcher:_process_instance_terminate_event] Instance i-0012f6c570f00bcd9 not found in the database.
2019-10-25 02:58:58,739 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://queue.amazonaws.com/684353139040/parallelcluster-meredithk-test-efa-nohyper1-intel3-SQS-1Q2QW8X745LEM', receipt_handle='AQEBc+SA48ZuhUmx1xVpZQipj8SM8xXztziZmkdk1lQjLwNB+F2rGHWrbG2ZKDtvMG4VsI1ek2PgC9fcw/aY6+Q/Tt+0jEMzYZhrDtwqycJpKYdFJzWjY5/blVSNbuc1ZQTqi7QhxlKkySEZ/igX4uFTGgVoZxGw6SFrDzq9IWjn7yJ54ZyJN8rPIthi57QmkU5inlSwPV5pcj6oAftOMPzGxcxv56KoMlqmgof6RIIW66esYzm89d4zWewk+iAolrmtkzD4eJoZQS/jbQT0HMRTFMlt5ufT48WEaKt5WUyL86i6UgCwKtuINdyqi3e/CeUEtxdU+n9oPvpn2im8+vth8dzg1JughlcSsJAJCKohHSamTpSaOhd4DWOW9DOnvGpjl/KBBodTAsg/6073UEr2mE2B8Qbjir3Nt7hwVKJD8iED8YVsMp3SdAfdcyg7naR94n1sZdcvi/PTx7/3K3WY7g==')
2019-10-25 02:59:28,779 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:59:30,787 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:00:00,818 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:00:02,827 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:00:32,857 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:00:34,865 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:01:04,895 INFO [sqswatcher:_poll_queue] Refreshing cluster properties
2019-10-25 03:01:04,988 INFO [utils:get_asg_settings] min/desired/max 0/1/6
2019-10-25 03:01:05,057 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:01:07,065 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:01:37,095 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:01:39,183 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue
2019-10-25 03:01:39,218 ERROR [sqswatcher:_process_instance_terminate_event] Instance i-07a689ee5610c8c26 not found in the database.
2019-10-25 03:01:39,218 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://queue.amazonaws.com/684353139040/parallelcluster-meredithk-test-efa-nohyper1-intel3-SQS-1Q2QW8X745LEM', receipt_handle='AQEByn1myDFouo6xqD1TvU+X7fZdqAVZhrqalZ47BunjAEM4egT7VKB1bGmRvyhzge+1CdZTJUyL6iFp5e/2HeAqCmeObPsaxQkFIUr7IdoplLIEqhuufuCdo/k2Z2BwJlW5naxLgrkHdPXqXl0t/xx06fN3lEnsiC3e1mSwxRyPqk1vxtFGInr8zMLk4Y8FSok91AYXmfQ+sBwcL4xASfBoz9AU9tqqhQA2KzHZltOA891GAi/HIp+lAvYvqWqiG9g03m7iAMzNEtq4beBeqhb4jkTAi8MuziLh/7ggezcLH2H6C9W4En/pEKK98zPqQwKdBHrP4anvelEBas9AvsoBKkTotnbH1bdIrljIe9sJmZLXZTGeWh/26b1AgITyJ5W5anZtPoh4t/t2L+Q5P9yH4Y3n2pLtQxwXVzjrCdJ0txy8KcjglI3vSmsznkg3iVqV7N+dXY5RhXhrA+k+csEWkA==')

Here is the config file:

[aws]
aws_access_key_id = ###
aws_secret_access_key = ###
#aws_region_name = us-east-1

[cluster default]
key_name = fire
master_instance_type = c5.9xlarge
compute_instance_type = c5n.18xlarge
base_os = centos7
#cluster_type = spot
spot_price = 5
initial_queue_size = 0
maintain_initial_size = true
max_queue_size = 6
vpc_settings = poc_vpn
tags = {"user" : "meredithk"}
fsx_settings = custom_fs
ebs_settings = shared
fs_settings = customefs
placement_group = DYNAMIC
enable_efa = compute
# centos7
#post_install = s3://postinstallfmg/parallelcluster-postinstall-centos7-v1.sh
# alinux
#post_install = s3://postinstallfmg/parallelcluster-postinstall-v1.sh
extra_json = { "cfncluster" : { "cfn_scheduler_slots" : "cores" } }
master_root_volume_size = 50

[vpc poc_vpn]
vpc_id = vpc-b53f65d1
use_public_ips = false
# useast1a
#master_subnet_id = subnet-a6a304fe
# useast1b
master_subnet_id = subnet-0f2f6a50145d03c60
# additional_sg necessary for efs mounting
additional_sg = sg-0b39ee73

[global]
sanity_check = true
update_check = true
cluster_template = default

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[fsx custom_fs]
shared_dir = /fsx
storage_capacity = 3600
imported_file_chunk_size = 1024
import_path = s3://fmglobal-virtual-fire-scenarios

[ebs shared]
shared_dir = /shared
volume_size = 2000

[efs customefs]
shared_dir = /efs
efs_fs_id = fs-4b70dd00

I’ve attached a screenshot of the autoscaling group from the AWS console

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 27 (13 by maintainers)

Most upvoted comments

VPC with multiple CIDR blocks is now supported as part of v2.5.1: https://github.com/aws/aws-parallelcluster/releases/tag/v2.5.1

demartinofra on Dec 13, 2019

This worked! Just fyi, on centos7 the restart command is sudo systemctl restart nfs

Could we keep this issue open so that when the bug is officially fixed I get notified?

karlvirgil on Dec 5, 2019