aws-parallelcluster: NodeFailures on SLURM w/ version 2.6

Environment:

2.6.0

[aws]
aws_region_name = us-east-1

[global]
cluster_template = pcprod
update_check = true
sanity_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster pcprod]

key_name = s4-sinaia-valinor-nphi
base_os = alinux2
cluster_type = spot
scheduler = slurm

master_instance_type = m4.4xlarge
compute_instance_type = c5.12xlarge
compute_root_volume_size = 500
master_root_volume_size = 800

initial_queue_size = 1
max_queue_size = 800
maintain_initial_size = true

s3_read_resource = arn:aws:s3:::s4-sinaia-valinor-nphi/*
s3_read_write_resource = arn:aws:s3:::s4-sinaia-valinor-nphi/*

tags = {"RunSchedule": "24x7", "PC": "PCPROD"}

extra_json = { "cluster" : { "ganglia_enabled" : "yes" } }

fsx_settings = performancefs
efs_settings = generalfs
scaling_settings = sc
cw_log_settings = pcprodlog

post_install = https://gitlab.com/iidsgt/parallel-cluster/-/raw/c3d78aef623eeeb642f938bad378901a4df293ca/ProdPostInstall.py

post_install_args = "{cromwell_db_user} {cromwell_db_password} {S3_Key} {S3_Secret}"

vpc_settings = pcluster

[cw_log pcprodlog]
enable = true
retention_days = 14

[efs generalfs]
shared_dir = /efs
performance_mode = generalPurpose

[fsx performancefs]
shared_dir = /fsx
storage_capacity = 9600
#imported_file_chunk_size = 1024
deployment_type = SCRATCH_2
#export_path = s3://parallelcluster/fsx
#import_path = s3://parallelcluster
weekly_maintenance_start_time = 5:00:00


[scaling sc]
scaledown_idletime = 30

[vpc pcluster]
### Excluded values

Bug description and how to reproduce: With ondemand or spot instances nodes are consistently being put into a failed state, or being marked as down and the job being marked as complete when it is not complete. Additional context:

Cloudwatch nodewatcher lines for a failed node.

15:25:04
2020-03-23 15:25:04,651 INFO [slurm:is_node_down] Node is in state: ''

15:25:04
2020-03-23 15:25:04,652 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down

15:25:14
2020-03-23 15:25:14,682 INFO [slurm:is_node_down] Node is in state: ''

15:25:14
2020-03-23 15:25:14,682 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down

15:25:24
2020-03-23 15:25:24,710 INFO [slurm:is_node_down] Node is in state: ''

15:25:24
2020-03-23 15:25:24,710 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down

15:25:34
2020-03-23 15:25:34,740 INFO [slurm:is_node_down] Node is in state: ''

15:25:34
2020-03-23 15:25:34,740 WARNING [nodewatcher:_poll_wait_for_node_ready] Node reported as down

15:25:34
2020-03-23 15:25:34,740 ERROR [nodewatcher:_terminate_if_down] Node is marked as down by scheduler or not attached correctly. Terminating...

Node Logs i-0ffa05db478debd93.tar.gz

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 39 (8 by maintainers)

Most upvoted comments

I’m going to resolve this since there hasn’t been any progress lately. Starting from ParallelCluster 2.9 we have heavily rearchitected our integration with Slurm scheduler. It might be worth giving it a try and feel free to open a new issue in case you are still blocked by this performance issue.

demartinofra on Nov 6, 2020