aws-parallelcluster: Node failing to bootstrap when encrypted_ephemeral is set to true on Alinux2 and CentOS8
Required Info:
- AWS ParallelCluster version: amazon/aws-parallelcluster-2.10.1-amzn2-hvm-x86_64-202012221234 (from used AMI)
- Cluster name: parallelcluster-test-cluster12
Bug description and how to reproduce: The parallel cluster uses multi-queue multi-node setup (see the config). I am able to use m4.xlarge and p3.2xlarge spot instances, but unable to use g4dn.xlarge instance (on-demand or spot instance).
From the master node, I get these errors:
$ srun -p g4dn-nxlarge --constraint g4dn.xlarge --pty /bin/bash -l
srun: error: Node failure on g4dn-nxlarge-dy-g4dnxlarge-1
srun: Force Terminated job 19
srun: error: Job allocation 19 has been revoked
$ srun -p ondemand-g4dn-p3-m4 --constraint g4dn-xlarge-ondemand --pty /bin/bash -l
srun: error: Node failure on ondemand-g4dn-p3-m4-dy-g4dnxlarge-1
srun: Force Terminated job 20
srun: error: Job allocation 20 has been revoked
The last and relevant lines on /var/log/parallelcluster/slurm_resume.log
are below
2021-01-29 14:13:24,952 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['g4dn-nxlarge-dy-g4dn2xlarge-1'] 2021-01-29 14:13:24,953 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished. 2021-01-29 14:24:08,784 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2021-01-29 14:24:08,785 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf 2021-01-29 14:24:08,786 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x154d9bf35668>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')
2021-01-29 14:24:08,788 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x154d9bf35668>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf') 2021-01-29 14:24:08,791 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:23:29.605271+00:00
2021-01-29 14:24:08,791 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: m4-nxlarge-dy-m4xlarge-1 2021-01-29 14:24:08,926 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['m4-nxlarge-dy-m4xlarge-1'] 2021-01-29 14:24:10,824 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('m4-nxlarge-dy-m4xlarge-1', EC2Instance(id='i-MASKED', private_ip='MASKED', hostname='ip-MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 24, 10, tzinfo=tzlocal())))"] 2021-01-29 14:24:10,825 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2021-01-29 14:24:10,859 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED 2021-01-29 14:24:10,859 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster 2021-01-29 14:24:11,133 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED 2021-01-29 14:24:11,134 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['m4-nxlarge-dy-m4xlarge-1']
2021-01-29 14:24:11,135 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished. 2021-01-29 14:29:20,823 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup. 2021-01-29 14:29:20,825 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf 2021-01-29 14:29:20,826 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a9ac828c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf') 2021-01-29 14:29:20,827 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a9ac828c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')
2021-01-29 14:29:20,830 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:28:29.904754+00:00
2021-01-29 14:29:20,830 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: g4dn-nxlarge-dy-g4dnxlarge-1 2021-01-29 14:29:20,916 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['g4dn-nxlarge-dy-g4dnxlarge-1'] 2021-01-29 14:29:22,666 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('g4dn-nxlarge-dy-g4dnxlarge-1', EC2Instance(id='i-008cda499d32e00f6', private_ip='MASKED', hostname='ip-MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 29, 22, tzinfo=tzlocal())))"] 2021-01-29 14:29:22,666 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2021-01-29 14:29:22,704 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED 2021-01-29 14:29:22,704 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster 2021-01-29 14:29:22,992 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED 2021-01-29 14:29:22,994 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['g4dn-nxlarge-dy-g4dnxlarge-1']
2021-01-29 14:29:22,995 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
2021-01-29 14:39:21,905 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2021-01-29 14:39:21,906 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2021-01-29 14:39:21,907 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='ip-MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a3e0804c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')
2021-01-29 14:39:21,909 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='ca-central-1', cluster_name='test-cluster12', dynamodb_table='parallelcluster-test-cluster12', hosted_zone='Z079224910MUW6KTZJ62V', dns_domain='test-cluster12.pcluster', use_private_hostname=False, head_node_private_ip='MASKED', head_node_hostname='MASKED.ca-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, instance_name_type_mapping={'c42xlarge': 'c4.2xlarge', 'c44xlarge': 'c4.4xlarge', 'c4large': 'c4.large', 'g4dn2xlarge': 'g4dn.2xlarge', 'g4dn8xlarge': 'g4dn.8xlarge', 'g4dnxlarge': 'g4dn.xlarge', 'm410xlarge': 'm4.10xlarge', 'm44xlarge': 'm4.4xlarge', 'm4xlarge': 'm4.xlarge', 'p32xlarge': 'p3.2xlarge', 'p316xlarge': 'p3.16xlarge', 'p38xlarge': 'p3.8xlarge'}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x14a3e0804c50>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')
2021-01-29 14:39:21,911 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2021-01-29 14:38:30.272105+00:00
2021-01-29 14:39:21,912 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: ondemand-g4dn-p3-m4-dy-g4dnxlarge-1
2021-01-29 14:39:21,996 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1']
2021-01-29 14:39:23,570 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('ondemand-g4dn-p3-m4-dy-g4dnxlarge-1', EC2Instance(id='i-MASKED', private_ip='MASKED', hostname='MASKED', launch_time=datetime.datetime(2021, 1, 29, 14, 39, 23, tzinfo=tzlocal())))"]
2021-01-29 14:39:23,570 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2021-01-29 14:39:23,604 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2021-01-29 14:39:23,604 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z079224910MUW6KTZJ62V - test-cluster12.pcluster
2021-01-29 14:39:23,954 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2021-01-29 14:39:23,956 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1']
2021-01-29 14:39:23,957 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
- From Head node:
/var/log/parallelcluster/clustermgtd.log
,/var/log/parallelcluster/slurm_resume.log
,/var/log/parallelcluster/slurm_suspend.log
, and/var/log/slurmctld.log
The config:
[aws]
aws_region_name = ca-central-1
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[global]
cluster_template = default
update_check = true
sanity_check = true
[cluster default]
base_os = alinux2
key_name = parallel-cluster-MASKED
vpc_settings = default
efs_settings = awselasticfs
fsx_settings = lustrefs
s3_read_resource = *
s3_read_write_resource = *
scheduler = slurm
master_instance_type = m4.xlarge
encrypted_ephemeral = true
master_root_volume_size = 25 # GB
compute_root_volume_size = 25 # GB
queue_settings = m4-nxlarge, c4-nxlarge, g4dn-nxlarge, p3-nxlarge, ondemand-g4dn-p3-m4
[vpc default]
vpc_id = vpc-MASKED
master_subnet_id = subnet-MASKED
compute_subnet_id = subnet-MASKED
##########################
# Instance and Queue Setup
##########################
# Useful docs:
# instance types: https://aws.amazon.com/ec2/instance-types/
# spot instance pricing: https://aws.amazon.com/ec2/spot/pricing/
[scaling custom]
scaledown_idletime = 5
[queue ondemand-g4dn-p3-m4]
compute_resource_settings = g4dn-xlarge-ondemand, p3-2xlarge-ondemand, m4-4xlarge-ondemand
compute_type = ondemand
[queue m4-nxlarge]
compute_resource_settings = m4-xlarge, m4-4xlarge, m4-10xlarge
compute_type = spot
[queue c4-nxlarge]
compute_resource_settings = c4-large, c4-2xlarge, c4-4xlarge
compute_type = spot
[queue g4dn-nxlarge]
compute_resource_settings = g4dn-xlarge, g4dn-2xlarge, g4dn-8xlarge
compute_type = spot
[queue p3-nxlarge]
compute_resource_settings = p3-2xlarge, p3-8xlarge, p3-16xlarge
compute_type = spot
[queue m4-xlarge-spot]
compute_resource_settings = m4-xlarge-initzero-spot
compute_type = spot
# Compute Resources
[compute_resource m4-xlarge]
instance_type = m4.xlarge # 4 cpu 16 GB
initial_count = 0
max_count = 20
[compute_resource m4-4xlarge-ondemand]
instance_type = m4.4xlarge # 16 cpu 64 GB
initial_count = 0
max_count = 1 # use only for debugging
[compute_resource m4-4xlarge]
instance_type = m4.4xlarge # 16 cpu 64 GB
initial_count = 0
max_count = 10
[compute_resource m4-10xlarge]
instance_type = m4.10xlarge
initial_count = 0
max_count = 4
[compute_resource c4-large]
instance_type = c4.large
initial_count = 0
max_count = 20
[compute_resource c4-2xlarge]
instance_type = c4.2xlarge
initial_count = 0
max_count = 10
[compute_resource c4-4xlarge]
instance_type = c4.4xlarge
initial_count = 0
max_count = 5
[compute_resource g4dn-xlarge-ondemand]
instance_type = g4dn.xlarge
initial_count = 0
max_count = 1
[compute_resource g4dn-xlarge]
instance_type = g4dn.xlarge
initial_count = 0
max_count = 20
[compute_resource g4dn-2xlarge]
instance_type = g4dn.2xlarge
initial_count = 0
max_count = 10
[compute_resource g4dn-4xlarge]
instance_type = g4dn.4xlarge
initial_count = 0
max_count = 5
[compute_resource g4dn-8xlarge]
instance_type = g4dn.8xlarge
initial_count = 0
max_count = 5
[compute_resource p3-2xlarge-ondemand]
instance_type = p3.2xlarge
initial_count = 0
max_count = 1
[compute_resource p3-2xlarge]
instance_type = p3.2xlarge
initial_count = 0
max_count = 5
[compute_resource p3-8xlarge]
instance_type = p3.8xlarge
initial_count = 0
max_count = 5
[compute_resource p3-16xlarge]
instance_type = p3.16xlarge
initial_count = 0
max_count = 5
#############
# FileSystems
#############
[efs awselasticfs]
shared_dir = /workspace
encrypted = true
efs_fs_id = fs-MASKED
performance_mode = generalPurpose
[fsx lustrefs]
shared_dir = /fsx
fsx_fs_id = fs-MASKED
Some lines from /var/log/parallelcluster/clustermgtd
2021-01-29 14:38:24,745 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2021-01-29 14:38:29,824 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2021-01-29 14:38:29,923 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2021-01-29 14:38:30,128 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2021-01-29 14:38:30,128 - [slurm_plugin.clustermgtd:_handle_powering_down_nodes] - INFO - Resetting powering down nodes: (x1) ['m4-nxlarge-dy-m4xlarge-1(10.98.3.36)']
2021-01-29 14:38:30,139 - [slurm_plugin.clustermgtd:_handle_powering_down_nodes] - INFO - Terminating instances that are backing powering down nodes 2021-01-29 14:38:30,149 - [slurm_plugin.common:delete_instances] - INFO - Terminating instances (x1) ['i-068f928953d0b84ba'] 2021-01-29 14:38:30,270 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2021-01-29 14:38:30,271 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2021-01-29 14:39:24,792 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2021-01-29 14:39:24,795 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2021-01-29 14:39:24,800 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2021-01-29 14:39:24,800 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2021-01-29 14:39:29,885 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2021-01-29 14:39:30,070 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2021-01-29 14:39:30,289 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2021-01-29 14:39:30,289 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2021-01-29 14:39:30,290 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2021-01-29 14:40:24,847 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2021-01-29 14:40:24,850 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2021-01-29 14:40:24,856 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2021-01-29 14:40:24,856 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2021-01-29 14:40:29,938 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2021-01-29 14:40:30,065 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2021-01-29 14:40:30,307 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2021-01-29 14:40:30,307 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2021-01-29 14:40:30,308 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2021-01-29 14:41:24,902 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2021-01-29 14:41:24,904 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2021-01-29 14:41:24,910 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2021-01-29 14:41:24,910 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2021-01-29 14:41:29,993 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2021-01-29 14:41:30,068 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2021-01-29 14:41:30,248 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2021-01-29 14:41:30,248 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2021-01-29 14:41:30,249 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2021-01-29 14:42:24,953 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2021-01-29 14:42:24,955 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2021-01-29 14:42:24,960 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2021-01-29 14:42:24,960 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2021-01-29 14:42:30,041 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2021-01-29 14:42:30,136 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2021-01-29 14:42:30,318 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2021-01-29 14:42:30,319 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2021-01-29 14:42:30,319 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2021-01-29 14:43:24,978 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2021-01-29 14:43:24,980 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2021-01-29 14:43:24,985 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2021-01-29 14:43:24,985 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2021-01-29 14:43:30,065 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2021-01-29 14:43:30,188 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2021-01-29 14:43:30,448 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2021-01-29 14:43:30,448 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_is_backing_instance_valid] - WARNING - Node state check: no corresponding instance in EC2 for node ondemand-g4dn-p3-m4-dy-g4dnxlarge-1(10.98.3.162)
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['ondemand-g4dn-p3-m4-dy-g4dnxlarge-1(10.98.3.162)']
2021-01-29 14:43:30,449 - [slurm_plugin.clustermgtd:_handle_unhealthy_dynamic_nodes] - INFO - Setting unhealthy dynamic nodes to down and power_down.
2021-01-29 14:43:30,469 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2021-01-29 14:44:25,021 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2021-01-29 14:44:25,024 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2021-01-29 14:44:25,029 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2021-01-29 14:44:25,029 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2021-01-29 14:44:30,109 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2021-01-29 14:44:30,196 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2021-01-29 14:44:30,418 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2021-01-29 14:44:30,418 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2021-01-29 14:44:30,419 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2021-01-29 14:45:25,043 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2021-01-29 14:45:25,046 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2021-01-29 14:45:25,053 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
Commits related to this issue
- Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` # mkfs -q /dev/ram1 1024 Could not st... — committed to lukeseawalker/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
- Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` # mkfs -q /dev/ram1 1024 Could not st... — committed to lukeseawalker/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
- Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` # mkfs -q /dev/ram1 1024 Could not st... — committed to lukeseawalker/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
- Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` # mkfs -q /dev/ram1 1024 Could not st... — committed to aws/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
- Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` Could not stat /dev/ram1 --- No such ... — committed to demartinofra/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
- Fix creation of ram device on Alinux2 Before creating the ram device, be sure that the brd (block ram disk) module is loaded, otherwise you'll get the error: ``` Could not stat /dev/ram1 --- No such ... — committed to aws/aws-parallelcluster-cookbook by lukeseawalker 3 years ago
There is a problem with
g4dn
andencrypted_ephemeral
set to true. While I continue looking into this, if encrypted ephemeral isn’t a strict requirement for you, you can set it tofalse
.I’m marking this as a bug. I confirm that
encrypted_ephemeral = false
is the best option to move forward at the moment. Please also notice thatg4dn
has Hardware Encrypted ephemeral storage by default, so you should be ok.