aws-parallelcluster: /etc/init.d/slurm restart fails on compute nodes

Launching many nodes (e.g. 15 C4.8xlarge instances) results in nodes taking very long to register with Slurm, and ultimately results in downscaling. Inspecting “sinfo”, about 1 node is registered with Slurm every 5-10 minutes. Further, the logic for scaling down the node count checks if nodes were running jobs in the past hour (outlined here: http://cfncluster.readthedocs.io/en/latest/processes.html#sqswatcher). High node count jobs cannot run since there aren’t enough nodes in a ready state, therefore the cluster will scale down even though jobs are pending.

There may be a problem with how nodes are registered to Slurm, as registration takes too long. The autoscaling logic needs to be revised since the policies for scaling up and scaling down can result in cyclic scaling behavior. Nodes may not register in time with the scheduler (not necessarily Slurm), resulting in a scale down event. Since jobs cannot run and are therefore pending, scaling up events will follow.

One possible fix for the downscaling behavior would be to only allow downscaling if no jobs are pending. Downscaling can also happen if jobs are pending, but a job can never run due to requiring more nodes than allowed by autoscaling (e.g. the only job in the queue needs 10 nodes, but the max node count is 5).

!jobs_pending || (max_allowed_node_count < min_required_nodes(jobs))

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 21 (10 by maintainers)

Most upvoted comments

@adavanisanti I tested this on Centos 7, specifically, I built a cluster on the following custom AMI:

ami-9cb9aef8

this is the Centos 7 AMI for eu-west-2, full list here:

https://github.com/awslabs/cfncluster/blob/master/amis.txt

brundle56uk on Nov 2, 2017