aws-parallelcluster: Node failing to bootstrap when encrypted_ephemeral is set to true on Alinux2 and CentOS8

Required Info:

AWS ParallelCluster version: amazon/aws-parallelcluster-2.10.1-amzn2-hvm-x86_64-202012221234 (from used AMI)
Cluster name: parallelcluster-test-cluster12

Bug description and how to reproduce: The parallel cluster uses multi-queue multi-node setup (see the config). I am able to use m4.xlarge and p3.2xlarge spot instances, but unable to use g4dn.xlarge instance (on-demand or spot instance).

From the master node, I get these errors:

$ srun -p g4dn-nxlarge --constraint g4dn.xlarge --pty /bin/bash -l
srun: error: Node failure on g4dn-nxlarge-dy-g4dnxlarge-1
srun: Force Terminated job 19
srun: error: Job allocation 19 has been revoked

$ srun -p ondemand-g4dn-p3-m4 --constraint g4dn-xlarge-ondemand --pty /bin/bash -l
srun: error: Node failure on ondemand-g4dn-p3-m4-dy-g4dnxlarge-1
srun: Force Terminated job 20
srun: error: Job allocation 20 has been revoked

The last and relevant lines on /var/log/parallelcluster/slurm_resume.log are below

2021-01-29 14:13:24,952 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['g4dn-nxlarge-dy-g4dn2xlarge-1']                                                                                                                                                                                                                         2021-01-29 14:13:24,953 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.                                                                                                                                                                                                                                                                       2021-01-29 14:24:08,784 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.                                                                                                                                                                                                                                                                        
2021-01-29 14:24:08,785 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf                                                                                                                                                                                                            2021-01-29 14:24:08,786 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x154d9bf35668>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                                  
2021-01-29 14:24:08,788 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x154d9bf35668>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                   2021-01-29 14:24:08,791 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:23:29.605271+00:00                                                                                                                                                                                                 
2021-01-29 14:24:08,791 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: m4-nxlarge-dy-m4xlarge-1                                                                                                                                                                                                            2021-01-29 14:24:08,926 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['m4-nxlarge-dy-m4xlarge-1']                                                                                                                                                                                                      2021-01-29 14:24:10,824 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('m4-nxlarge-dy-m4xlarge-1', EC2Instance(id='i-MASKED', private_ip='MASKED', hostname='ip-MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 24, 10, tzinfo=tzlocal())))"]                        2021-01-29 14:24:10,825 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB                                                                                                                                                                                                                                    
2021-01-29 14:24:10,859 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED                                                                                                                                                                                                                                               2021-01-29 14:24:10,859 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster                                                                                                                                                                                                     2021-01-29 14:24:11,133 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED                                                                                                                                                                                                                                                2021-01-29 14:24:11,134 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['m4-nxlarge-dy-m4xlarge-1']                                                                                                                                                                                                                              
2021-01-29 14:24:11,135 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.                                                                                                                                                                                                                                                                       2021-01-29 14:29:20,823 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.                                                                                                                                                                                                                                                                        2021-01-29 14:29:20,825 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf                                                                                                                                                                                                            2021-01-29 14:29:20,826 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a9ac828c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                                  2021-01-29 14:29:20,827 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a9ac828c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                   
2021-01-29 14:29:20,830 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:28:29.904754+00:00                                                                                                                                                                                                 
2021-01-29 14:29:20,830 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: g4dn-nxlarge-dy-g4dnxlarge-1                                                                                                                                                                                                        2021-01-29 14:29:20,916 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['g4dn-nxlarge-dy-g4dnxlarge-1']                                                                                                                                                                                                  2021-01-29 14:29:22,666 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('g4dn-nxlarge-dy-g4dnxlarge-1', EC2Instance(id='i-008cda499d32e00f6', private_ip='MASKED', hostname='ip-MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 29, 22, tzinfo=tzlocal())))"]                    2021-01-29 14:29:22,666 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB                                                                                                                                                                                                                                    
2021-01-29 14:29:22,704 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED                                                                                                                                                                                                                                               2021-01-29 14:29:22,704 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster                                                                                                                                                                                                     2021-01-29 14:29:22,992 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED                                                                                                                                                                                                                                                2021-01-29 14:29:22,994 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['g4dn-nxlarge-dy-g4dnxlarge-1']                                                                                                                                                                                                                          
2021-01-29 14:29:22,995 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.                                                                                                                                                                                                                                                                       
2021-01-29 14:39:21,905 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.                                                                                                                                                                                                                                                                        
2021-01-29 14:39:21,906 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf                                                                                                                                                                                                            
2021-01-29 14:39:21,907 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a3e0804c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                                  
2021-01-29 14:39:21,909 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a3e0804c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')                                                                                                                   
2021-01-29 14:39:21,911 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:38:30.272105+00:00                                                                                                                                                                                                 
2021-01-29 14:39:21,912 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: ondemand-g4dn-p3-m4-dy-g4dnxlarge-1                                                                                                                                                                                                 
2021-01-29 14:39:21,996 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1']                                                                                                                                                                                           
2021-01-29 14:39:23,570 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('ondemand-g4dn-p3-m4-dy-g4dnxlarge-1', EC2Instance(id='i-MASKED', private_ip='MASKED', hostname='MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 39, 23, tzinfo=tzlocal())))"]           
2021-01-29 14:39:23,570 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB                                                                                                                                                                                                                                    
2021-01-29 14:39:23,604 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED                                                                                                                                                                                                                                               
2021-01-29 14:39:23,604 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster                                                                                                                                                                                                     
2021-01-29 14:39:23,954 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED                                                                                                                                                                                                                                                
2021-01-29 14:39:23,956 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1']                                                                                                                                                                                                                   
2021-01-29 14:39:23,957 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.

From Head node: /var/log/parallelcluster/clustermgtd.log, /var/log/parallelcluster/slurm_resume.log, /var/log/parallelcluster/slurm_suspend.log, and /var/log/slurmctld.log

The config:

[aws]
aws_region_name = ca-central-1

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[cluster default]
base_os = alinux2
key_name = parallel-cluster-MASKED
vpc_settings = default
efs_settings = awselasticfs
fsx_settings = lustrefs
s3_read_resource = *
s3_read_write_resource = *
scheduler = slurm
master_instance_type = m4.xlarge
encrypted_ephemeral = true
master_root_volume_size = 25   # GB
compute_root_volume_size = 25  # GB
queue_settings = m4-nxlarge, c4-nxlarge, g4dn-nxlarge, p3-nxlarge, ondemand-g4dn-p3-m4



[vpc default]
vpc_id = vpc-MASKED
master_subnet_id = subnet-MASKED
compute_subnet_id = subnet-MASKED

##########################
# Instance and Queue Setup
##########################
# Useful docs:
# instance types: https://aws.amazon.com/ec2/instance-types/
# spot instance pricing: https://aws.amazon.com/ec2/spot/pricing/

[scaling custom]
scaledown_idletime = 5

[queue ondemand-g4dn-p3-m4]
compute_resource_settings = g4dn-xlarge-ondemand, p3-2xlarge-ondemand, m4-4xlarge-ondemand
compute_type = ondemand

[queue m4-nxlarge]
compute_resource_settings = m4-xlarge, m4-4xlarge, m4-10xlarge
compute_type = spot

[queue c4-nxlarge]
compute_resource_settings = c4-large, c4-2xlarge, c4-4xlarge
compute_type = spot

[queue g4dn-nxlarge]
compute_resource_settings = g4dn-xlarge, g4dn-2xlarge, g4dn-8xlarge
compute_type = spot

[queue p3-nxlarge]
compute_resource_settings = p3-2xlarge, p3-8xlarge, p3-16xlarge
compute_type = spot

[queue m4-xlarge-spot]
compute_resource_settings = m4-xlarge-initzero-spot
compute_type = spot

# Compute Resources
[compute_resource m4-xlarge]
instance_type = m4.xlarge   # 4 cpu 16 GB
initial_count = 0
max_count = 20

[compute_resource m4-4xlarge-ondemand]
instance_type = m4.4xlarge  # 16 cpu 64 GB
initial_count = 0
max_count = 1 # use only for debugging

[compute_resource m4-4xlarge]
instance_type = m4.4xlarge  # 16 cpu 64 GB
initial_count = 0
max_count = 10

[compute_resource m4-10xlarge]
instance_type = m4.10xlarge
initial_count = 0
max_count = 4

[compute_resource c4-large]
instance_type = c4.large
initial_count = 0
max_count = 20

[compute_resource c4-2xlarge]
instance_type = c4.2xlarge
initial_count = 0
max_count = 10

[compute_resource c4-4xlarge]
instance_type = c4.4xlarge
initial_count = 0
max_count = 5

[compute_resource g4dn-xlarge-ondemand]
instance_type = g4dn.xlarge
initial_count = 0
max_count = 1


[compute_resource g4dn-xlarge]
instance_type = g4dn.xlarge
initial_count = 0
max_count = 20

[compute_resource g4dn-2xlarge]
instance_type = g4dn.2xlarge
initial_count = 0
max_count = 10

[compute_resource g4dn-4xlarge]
instance_type = g4dn.4xlarge
initial_count = 0
max_count = 5

[compute_resource g4dn-8xlarge]
instance_type = g4dn.8xlarge
initial_count = 0
max_count = 5


[compute_resource p3-2xlarge-ondemand]
instance_type = p3.2xlarge
initial_count = 0
max_count = 1

[compute_resource p3-2xlarge]
instance_type = p3.2xlarge
initial_count = 0
max_count = 5

[compute_resource p3-8xlarge]
instance_type = p3.8xlarge
initial_count = 0
max_count = 5

[compute_resource p3-16xlarge]
instance_type = p3.16xlarge
initial_count = 0
max_count = 5


#############
# FileSystems
#############
[efs awselasticfs]
shared_dir = /workspace
encrypted = true
efs_fs_id = fs-MASKED
performance_mode = generalPurpose

[fsx lustrefs]
shared_dir = /fsx
fsx_fs_id = fs-MASKED

Some lines from /var/log/parallelcluster/clustermgtd

2021-01-29 14:38:24,745 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       2021-01-29 14:38:29,824 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   2021-01-29 14:38:29,923 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:38:30,128 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           2021-01-29 14:38:30,128 - [slurm_plugin.clustermgtd:_handle_powering_down_nodes] - INFO - Resetting powering down nodes: (x1) ['m4-nxlarge-dy-m4xlarge-1(10.98.3.36)']                                                                                                                                                                                      
2021-01-29 14:38:30,139 - [slurm_plugin.clustermgtd:_handle_powering_down_nodes] - INFO - Terminating instances that are backing powering down nodes                                                                                                                                                                                                        2021-01-29 14:38:30,149 - [slurm_plugin.common:delete_instances] - INFO - Terminating instances (x1) ['i-068f928953d0b84ba']                                                                                                                                                                                                                                2021-01-29 14:38:30,270 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:38:30,271 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  2021-01-29 14:39:24,792 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        
2021-01-29 14:39:24,795 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            2021-01-29 14:39:24,800 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                2021-01-29 14:39:24,800 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       2021-01-29 14:39:29,885 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   
2021-01-29 14:39:30,070 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:39:30,289 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           2021-01-29 14:39:30,289 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:39:30,290 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  
2021-01-29 14:40:24,847 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        2021-01-29 14:40:24,850 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            2021-01-29 14:40:24,856 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                2021-01-29 14:40:24,856 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       2021-01-29 14:40:29,938 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   2021-01-29 14:40:30,065 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:40:30,307 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           2021-01-29 14:40:30,307 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:40:30,308 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  2021-01-29 14:41:24,902 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        2021-01-29 14:41:24,904 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            
2021-01-29 14:41:24,910 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                
2021-01-29 14:41:24,910 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       2021-01-29 14:41:29,993 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   2021-01-29 14:41:30,068 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:41:30,248 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           
2021-01-29 14:41:30,248 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:41:30,249 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  2021-01-29 14:42:24,953 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        2021-01-29 14:42:24,955 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            
2021-01-29 14:42:24,960 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                
2021-01-29 14:42:24,960 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       
2021-01-29 14:42:30,041 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   
2021-01-29 14:42:30,136 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        2021-01-29 14:42:30,318 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           2021-01-29 14:42:30,319 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         2021-01-29 14:42:30,319 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  
2021-01-29 14:43:24,978 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        2021-01-29 14:43:24,980 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            2021-01-29 14:43:24,985 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                2021-01-29 14:43:24,985 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       
2021-01-29 14:43:30,065 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   
2021-01-29 14:43:30,188 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        
2021-01-29 14:43:30,448 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           
2021-01-29 14:43:30,448 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_is_backing_instance_valid] - WARNING - Node state check: no corresponding instance in EC2 for node ondemand-g4dn-p3-m4-dy-g4dnxlarge-1(10.98.3.162)                                                                                                                                                    
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1(10.98.3.162)']                                                                                                                                                                        
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_handle_unhealthy_dynamic_nodes] - INFO - Setting unhealthy dynamic nodes to down and power_down.                                                                                                                                                                                                       
2021-01-29 14:43:30,469 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  
2021-01-29 14:44:25,021 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        
2021-01-29 14:44:25,024 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            
2021-01-29 14:44:25,029 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING                                                                                                                                                                                                                
2021-01-29 14:44:25,029 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler                                                                                                                                                                                                                                       
2021-01-29 14:44:30,109 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster                                                                                                                                                                                                                   
2021-01-29 14:44:30,196 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions                                                                                                                                                                                                                        
2021-01-29 14:44:30,418 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions                                                                                                                                                                                                                                           
2021-01-29 14:44:30,418 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []                                                                                                                                                                                                                         
2021-01-29 14:44:30,419 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance                                                                                                                                                                                                                                  
2021-01-29 14:45:25,043 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf                                                                                                                                                                                                        
2021-01-29 14:45:25,046 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...                                                                                                                                                                                                                                                            
2021-01-29 14:45:25,053 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (8 by maintainers)

Commits related to this issue

Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` # mkfs -q /dev/ram1 1024 Could not st... — committed to lukeseawalker/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` # mkfs -q /dev/ram1 1024 Could not st... — committed to lukeseawalker/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` # mkfs -q /dev/ram1 1024 Could not st... — committed to lukeseawalker/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` # mkfs -q /dev/ram1 1024 Could not st... — committed to aws/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` Could not stat /dev/ram1 --- No such ... — committed to demartinofra/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` Could not stat /dev/ram1 --- No such ... — committed to aws/aws-parallelcluster-cookbook by lukeseawalker 3 years ago

Most upvoted comments

There is a problem with g4dn and encrypted_ephemeral set to true. While I continue looking into this, if encrypted ephemeral isn’t a strict requirement for you, you can set it to false.

lukeseawalker on Jan 29, 2021

I’m marking this as a bug. I confirm that encrypted_ephemeral = false is the best option to move forward at the moment. Please also notice that g4dn has Hardware Encrypted ephemeral storage by default, so you should be ok.

lukeseawalker on Feb 1, 2021