aws-parallelcluster: Node failing to bootstrap when encrypted_ephemeral is set to true on Alinux2 and CentOS8

Required Info:

  • AWS ParallelCluster version: amazon/aws-parallelcluster-2.10.1-amzn2-hvm-x86_64-202012221234 (from used AMI)
  • Cluster name: parallelcluster-test-cluster12

Bug description and how to reproduce: The parallel cluster uses multi-queue multi-node setup (see the config). I am able to use m4.xlarge and p3.2xlarge spot instances, but unable to use g4dn.xlarge instance (on-demand or spot instance).

From the master node, I get these errors:

$ srun -p g4dn-nxlarge --constraint g4dn.xlarge --pty /bin/bash -l
srun: error: Node failure on g4dn-nxlarge-dy-g4dnxlarge-1
srun: Force Terminated job 19
srun: error: Job allocation 19 has been revoked

$ srun -p ondemand-g4dn-p3-m4 --constraint g4dn-xlarge-ondemand --pty /bin/bash -l
srun: error: Node failure on ondemand-g4dn-p3-m4-dy-g4dnxlarge-1
srun: Force Terminated job 20
srun: error: Job allocation 20 has been revoked

The last and relevant lines on /var/log/parallelcluster/slurm_resume.log are below

2021-01-29 14:13:24,952 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['g4dn-nxlarge-dy-g4dn2xlarge-1']                                                                                                                                                                                                                         2021-01-29 14:13:24,953 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.                                                                                                                                                                                                                                                                       2021-01-29 14:24:08,784 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.                                                                                                                                                                                                                                                                        
2021-01-29 14:24:08,785 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf                                                                                                                                                                                                            2021-01-29 14:24:08,786 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x154d9bf35668>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                                  
2021-01-29 14:24:08,788 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x154d9bf35668>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                   2021-01-29 14:24:08,791 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:23:29.605271+00:00                                                                                                                                                                                                 
2021-01-29 14:24:08,791 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: m4-nxlarge-dy-m4xlarge-1                                                                                                                                                                                                            2021-01-29 14:24:08,926 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['m4-nxlarge-dy-m4xlarge-1']                                                                                                                                                                                                      2021-01-29 14:24:10,824 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('m4-nxlarge-dy-m4xlarge-1', EC2Instance(id='i-MASKED', private_ip='MASKED', hostname='ip-MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 24, 10, tzinfo=tzlocal())))"]                        2021-01-29 14:24:10,825 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB                                                                                                                                                                                                                                    
2021-01-29 14:24:10,859 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED                                                                                                                                                                                                                                               2021-01-29 14:24:10,859 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster                                                                                                                                                                                                     2021-01-29 14:24:11,133 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED                                                                                                                                                                                                                                                2021-01-29 14:24:11,134 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['m4-nxlarge-dy-m4xlarge-1']                                                                                                                                                                                                                              
2021-01-29 14:24:11,135 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.                                                                                                                                                                                                                                                                       2021-01-29 14:29:20,823 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.                                                                                                                                                                                                                                                                        2021-01-29 14:29:20,825 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf                                                                                                                                                                                                            2021-01-29 14:29:20,826 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a9ac828c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                                  2021-01-29 14:29:20,827 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a9ac828c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                   
2021-01-29 14:29:20,830 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:28:29.904754+00:00                                                                                                                                                                                                 
2021-01-29 14:29:20,830 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: g4dn-nxlarge-dy-g4dnxlarge-1                                                                                                                                                                                                        2021-01-29 14:29:20,916 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['g4dn-nxlarge-dy-g4dnxlarge-1']                                                                                                                                                                                                  2021-01-29 14:29:22,666 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('g4dn-nxlarge-dy-g4dnxlarge-1', EC2Instance(id='i-008cda499d32e00f6', private_ip='MASKED', hostname='ip-MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 29, 22, tzinfo=tzlocal())))"]                    2021-01-29 14:29:22,666 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB                                                                                                                                                                                                                                    
2021-01-29 14:29:22,704 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED                                                                                                                                                                                                                                               2021-01-29 14:29:22,704 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster                                                                                                                                                                                                     2021-01-29 14:29:22,992 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED                                                                                                                                                                                                                                                2021-01-29 14:29:22,994 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['g4dn-nxlarge-dy-g4dnxlarge-1']                                                                                                                                                                                                                          
2021-01-29 14:29:22,995 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.                                                                                                                                                                                                                                                                       
2021-01-29 14:39:21,905 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.                                                                                                                                                                                                                                                                        
2021-01-29 14:39:21,906 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf                                                                                                                                                                                                            
2021-01-29 14:39:21,907 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a3e0804c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                                  
2021-01-29 14:39:21,909 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a3e0804c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                   
2021-01-29 14:39:21,911 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:38:30.272105+00:00                                                                                                                                                                                                 
2021-01-29 14:39:21,912 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: ondemand-g4dn-p3-m4-dy-g4dnxlarge-1                                                                                                                                                                                                 
2021-01-29 14:39:21,996 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1']                                                                                                                                                                                           
2021-01-29 14:39:23,570 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('ondemand-g4dn-p3-m4-dy-g4dnxlarge-1', EC2Instance(id='i-MASKED', private_ip='MASKED', hostname='MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 39, 23, tzinfo=tzlocal())))"]           
2021-01-29 14:39:23,570 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB                                                                                                                                                                                                                                    
2021-01-29 14:39:23,604 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED                                                                                                                                                                                                                                               
2021-01-29 14:39:23,604 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster                                                                                                                                                                                                     
2021-01-29 14:39:23,954 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED                                                                                                                                                                                                                                                
2021-01-29 14:39:23,956 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1']                                                                                                                                                                                                                   
2021-01-29 14:39:23,957 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.                                                                                                                                                                                                                                                                       
  • From Head node: /var/log/parallelcluster/clustermgtd.log, /var/log/parallelcluster/slurm_resume.log, /var/log/parallelcluster/slurm_suspend.log, and /var/log/slurmctld.log

The config:

[aws]
aws_region_name = ca-central-1

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[cluster default]
base_os = alinux2
key_name = parallel-cluster-MASKED
vpc_settings = default
efs_settings = awselasticfs
fsx_settings = lustrefs
s3_read_resource = *
s3_read_write_resource = *
scheduler = slurm
master_instance_type = m4.xlarge
encrypted_ephemeral = true
master_root_volume_size = 25   # GB
compute_root_volume_size = 25  # GB
queue_settings = m4-nxlarge, c4-nxlarge, g4dn-nxlarge, p3-nxlarge, ondemand-g4dn-p3-m4



[vpc default]
vpc_id = vpc-MASKED
master_subnet_id = subnet-MASKED
compute_subnet_id = subnet-MASKED

##########################
# Instance and Queue Setup
##########################
# Useful docs:
# instance types: https://aws.amazon.com/ec2/instance-types/
# spot instance pricing: https://aws.amazon.com/ec2/spot/pricing/

[scaling custom]
scaledown_idletime = 5

[queue ondemand-g4dn-p3-m4]
compute_resource_settings = g4dn-xlarge-ondemand, p3-2xlarge-ondemand, m4-4xlarge-ondemand
compute_type = ondemand

[queue m4-nxlarge]
compute_resource_settings = m4-xlarge, m4-4xlarge, m4-10xlarge
compute_type = spot

[queue c4-nxlarge]
compute_resource_settings = c4-large, c4-2xlarge, c4-4xlarge
compute_type = spot

[queue g4dn-nxlarge]
compute_resource_settings = g4dn-xlarge, g4dn-2xlarge, g4dn-8xlarge
compute_type = spot

[queue p3-nxlarge]
compute_resource_settings = p3-2xlarge, p3-8xlarge, p3-16xlarge
compute_type = spot

[queue m4-xlarge-spot]
compute_resource_settings = m4-xlarge-initzero-spot
compute_type = spot

# Compute Resources
[compute_resource m4-xlarge]
instance_type = m4.xlarge   # 4 cpu 16 GB
initial_count = 0
max_count = 20

[compute_resource m4-4xlarge-ondemand]
instance_type = m4.4xlarge  # 16 cpu 64 GB
initial_count = 0
max_count = 1 # use only for debugging

[compute_resource m4-4xlarge]
instance_type = m4.4xlarge  # 16 cpu 64 GB
initial_count = 0
max_count = 10

[compute_resource m4-10xlarge]
instance_type = m4.10xlarge
initial_count = 0
max_count = 4

[compute_resource c4-large]
instance_type = c4.large
initial_count = 0
max_count = 20

[compute_resource c4-2xlarge]
instance_type = c4.2xlarge
initial_count = 0
max_count = 10

[compute_resource c4-4xlarge]
instance_type = c4.4xlarge
initial_count = 0
max_count = 5

[compute_resource g4dn-xlarge-ondemand]
instance_type = g4dn.xlarge
initial_count = 0
max_count = 1


[compute_resource g4dn-xlarge]
instance_type = g4dn.xlarge
initial_count = 0
max_count = 20

[compute_resource g4dn-2xlarge]
instance_type = g4dn.2xlarge
initial_count = 0
max_count = 10

[compute_resource g4dn-4xlarge]
instance_type = g4dn.4xlarge
initial_count = 0
max_count = 5

[compute_resource g4dn-8xlarge]
instance_type = g4dn.8xlarge
initial_count = 0
max_count = 5


[compute_resource p3-2xlarge-ondemand]
instance_type = p3.2xlarge
initial_count = 0
max_count = 1

[compute_resource p3-2xlarge]
instance_type = p3.2xlarge
initial_count = 0
max_count = 5

[compute_resource p3-8xlarge]
instance_type = p3.8xlarge
initial_count = 0
max_count = 5

[compute_resource p3-16xlarge]
instance_type = p3.16xlarge
initial_count = 0
max_count = 5


#############
# FileSystems
#############
[efs awselasticfs]
shared_dir = /workspace
encrypted = true
efs_fs_id = fs-MASKED
performance_mode = generalPurpose

[fsx lustrefs]
shared_dir = /fsx
fsx_fs_id = fs-MASKED

Some lines from /var/log/parallelcluster/clustermgtd

2021-01-29 14:38:24,745 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       2021-01-29 14:38:29,824 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   2021-01-29 14:38:29,923 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:38:30,128 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           2021-01-29 14:38:30,128 - [slurm_plugin.clustermgtd:_handle_powering_down_nodes] - INFO - Resetting powering down nodes: (x1) ['m4-nxlarge-dy-m4xlarge-1(10.98.3.36)']                                                                                                                                                                                      
2021-01-29 14:38:30,139 - [slurm_plugin.clustermgtd:_handle_powering_down_nodes] - INFO - Terminating instances that are backing powering down nodes                                                                                                                                                                                                        2021-01-29 14:38:30,149 - [slurm_plugin.common:delete_instances] - INFO - Terminating instances (x1) ['i-068f928953d0b84ba']                                                                                                                                                                                                                                2021-01-29 14:38:30,270 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:38:30,271 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  2021-01-29 14:39:24,792 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        
2021-01-29 14:39:24,795 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            2021-01-29 14:39:24,800 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                2021-01-29 14:39:24,800 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       2021-01-29 14:39:29,885 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   
2021-01-29 14:39:30,070 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:39:30,289 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           2021-01-29 14:39:30,289 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:39:30,290 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  
2021-01-29 14:40:24,847 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        2021-01-29 14:40:24,850 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            2021-01-29 14:40:24,856 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                2021-01-29 14:40:24,856 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       2021-01-29 14:40:29,938 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   2021-01-29 14:40:30,065 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:40:30,307 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           2021-01-29 14:40:30,307 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:40:30,308 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  2021-01-29 14:41:24,902 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        2021-01-29 14:41:24,904 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            
2021-01-29 14:41:24,910 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                
2021-01-29 14:41:24,910 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       2021-01-29 14:41:29,993 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   2021-01-29 14:41:30,068 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:41:30,248 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           
2021-01-29 14:41:30,248 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:41:30,249 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  2021-01-29 14:42:24,953 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        2021-01-29 14:42:24,955 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            
2021-01-29 14:42:24,960 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                
2021-01-29 14:42:24,960 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       
2021-01-29 14:42:30,041 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   
2021-01-29 14:42:30,136 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:42:30,318 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           2021-01-29 14:42:30,319 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:42:30,319 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  
2021-01-29 14:43:24,978 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        2021-01-29 14:43:24,980 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            2021-01-29 14:43:24,985 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                2021-01-29 14:43:24,985 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       
2021-01-29 14:43:30,065 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   
2021-01-29 14:43:30,188 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        
2021-01-29 14:43:30,448 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           
2021-01-29 14:43:30,448 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_is_backing_instance_valid] - WARNING - Node state check: no corresponding instance in EC2 for node ondemand-g4dn-p3-m4-dy-g4dnxlarge-1(10.98.3.162)                                                                                                                                                    
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1(10.98.3.162)']                                                                                                                                                                        
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_handle_unhealthy_dynamic_nodes] - INFO - Setting unhealthy dynamic nodes to down and power_down.                                                                                                                                                                                                       
2021-01-29 14:43:30,469 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  
2021-01-29 14:44:25,021 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        
2021-01-29 14:44:25,024 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            
2021-01-29 14:44:25,029 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                
2021-01-29 14:44:25,029 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       
2021-01-29 14:44:30,109 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   
2021-01-29 14:44:30,196 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        
2021-01-29 14:44:30,418 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           
2021-01-29 14:44:30,418 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         
2021-01-29 14:44:30,419 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  
2021-01-29 14:45:25,043 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        
2021-01-29 14:45:25,046 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            
2021-01-29 14:45:25,053 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING         

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (8 by maintainers)

Commits related to this issue

Most upvoted comments

There is a problem with g4dn and encrypted_ephemeral set to true. While I continue looking into this, if encrypted ephemeral isn’t a strict requirement for you, you can set it to false.

I’m marking this as a bug. I confirm that encrypted_ephemeral = false is the best option to move forward at the moment. Please also notice that g4dn has Hardware Encrypted ephemeral storage by default, so you should be ok.