aws-parallelcluster: NodeFailures on SLURM w/ version 2.6
Environment:
- 2.6.0
[aws]
aws_region_name = us-east-1
[global]
cluster_template = pcprod
update_check = true
sanity_check = true
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[cluster pcprod]
key_name = s4-sinaia-valinor-nphi
base_os = alinux2
cluster_type = spot
scheduler = slurm
master_instance_type = m4.4xlarge
compute_instance_type = c5.12xlarge
compute_root_volume_size = 500
master_root_volume_size = 800
initial_queue_size = 1
max_queue_size = 800
maintain_initial_size = true
s3_read_resource = arn:aws:s3:::s4-sinaia-valinor-nphi/*
s3_read_write_resource = arn:aws:s3:::s4-sinaia-valinor-nphi/*
tags = {"RunSchedule": "24x7", "PC": "PCPROD"}
extra_json = { "cluster" : { "ganglia_enabled" : "yes" } }
fsx_settings = performancefs
efs_settings = generalfs
scaling_settings = sc
cw_log_settings = pcprodlog
post_install = https://gitlab.com/iidsgt/parallel-cluster/-/raw/c3d78aef623eeeb642f938bad378901a4df293ca/ProdPostInstall.py
post_install_args = "{cromwell_db_user} {cromwell_db_password} {S3_Key} {S3_Secret}"
vpc_settings = pcluster
[cw_log pcprodlog]
enable = true
retention_days = 14
[efs generalfs]
shared_dir = /efs
performance_mode = generalPurpose
[fsx performancefs]
shared_dir = /fsx
storage_capacity = 9600
#imported_file_chunk_size = 1024
deployment_type = SCRATCH_2
#export_path = s3://parallelcluster/fsx
#import_path = s3://parallelcluster
weekly_maintenance_start_time = 5:00:00
[scaling sc]
scaledown_idletime = 30
[vpc pcluster]
### Excluded values
Bug description and how to reproduce: With ondemand or spot instances nodes are consistently being put into a failed state, or being marked as down and the job being marked as complete when it is not complete. Additional context:
Cloudwatch nodewatcher lines for a failed node.
15:25:04
2020-03-23 15:25:04,651 INFO [slurm:is_node_down] Node is in state: ''
15:25:04
2020-03-23 15:25:04,652 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down
15:25:14
2020-03-23 15:25:14,682 INFO [slurm:is_node_down] Node is in state: ''
15:25:14
2020-03-23 15:25:14,682 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down
15:25:24
2020-03-23 15:25:24,710 INFO [slurm:is_node_down] Node is in state: ''
15:25:24
2020-03-23 15:25:24,710 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down
15:25:34
2020-03-23 15:25:34,740 INFO [slurm:is_node_down] Node is in state: ''
15:25:34
2020-03-23 15:25:34,740 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down
15:25:34
2020-03-23 15:25:34,740 ERROR [nodewatcher:_terminate_if_down] Node is marked as down by scheduler or not attached correctly. Terminating...
Node Logs i-0ffa05db478debd93.tar.gz
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 39 (8 by maintainers)
I’m going to resolve this since there hasn’t been any progress lately. Starting from ParallelCluster 2.9 we have heavily rearchitected our integration with Slurm scheduler. It might be worth giving it a try and feel free to open a new issue in case you are still blocked by this performance issue.