aws-parallelcluster: Failing to create cluster when using GPU instance with ubuntu 18 and Slurm
Environment:
- AWS ParallelCluster / CfnCluster version: 2.5.0
- OS: Ubuntu 18.04
- Scheduler: Slurm
- Master instance type: g3.4xlarge
- Compute instance type: g3.4xlarge
Bug description and how to reproduce:
Master instance initializes properly but compute instances get stuck in Initializing state in EC2 console and hence overall cluster setup fails. I let parallelcluster initialize VPCs, security groups, etc. all on its own, they should all be default/valid (I did some manual verification to e.g. make sure VPC settings matched what the docs said they should be).
$ pcluster create -c /home/foo/.parallelcluster/config cluster1
Beginning cluster creation for cluster: cluster1
Creating stack named: parallelcluster-cluster1
Status: parallelcluster-cluster1 - ROLLBACK_IN_PROGRESS
Cluster creation failed. Failed events:
- AWS::AutoScaling::AutoScalingGroup ComputeFleet Received 2 FAILURE signal(s) out of 2. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
Additional context:
Config file:
[aws]
aws_region_name = us-west-2
[global]
cluster_template = default
update_check = true
sanity_check = true
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[cluster default]
key_name = gpu_keypair
base_os = ubuntu1804
scheduler = slurm
master_instance_type = g3.4xlarge
compute_instance_type = g3.4xlarge
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = true
vpc_settings = default
[vpc default]
vpc_id = vpc-02de6aff5174ff11e
master_subnet_id = subnet-0cc14c110a85fdd4e
master /var/log/cfn-init.log
: attached
master /var/log/cloud-init.log
: attached
master /var/log/cloud-init-output.log
: attached
master /var/log/jobwatcher
: attached
master /var/log/sqswatcher
: attached
cannot connect to compute nodes, no logs attached
cloud-init-output.log cloud-init.log cfn-init.log jobwatcher.log sqswatcher.log
Thanks in advance for your help!!
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 22 (11 by maintainers)
Commits related to this issue
- Add libglvnd-dev package to Ubuntu 18 This is required to install Nvidia drivers Fixes https://github.com/aws/aws-parallelcluster/issues/1479 Signed-off-by: Francesco De Martino <fdm@amazon.com> — committed to demartinofra/aws-parallelcluster-cookbook by demartinofra 5 years ago
- Add libglvnd-dev package to Ubuntu 18 This is required to install Nvidia drivers Fixes https://github.com/aws/aws-parallelcluster/issues/1479 Signed-off-by: Francesco De Martino <fdm@amazon.com> — committed to demartinofra/aws-parallelcluster-cookbook by demartinofra 5 years ago
- Add libglvnd-dev package to Ubuntu 18 This is required to install Nvidia drivers Fixes https://github.com/aws/aws-parallelcluster/issues/1479 Signed-off-by: Francesco De Martino <fdm@amazon.com> — committed to aws/aws-parallelcluster-cookbook by demartinofra 5 years ago
We put together and tested a script to patch the issue you are facing and unblock your cluster creation. The script is the following:
It takes around 50 seconds to execute the script and install the missing drivers. There are 2 alternative ways you can apply this fix:
pre_install
script as documented here. This means that every node of the cluster is going to take 1 additional minute to be bootstrapped.In the meanwhile we are going to work with priority on a permanent fix and provide a new official patch release as soon as possible.
Please let us know if you need additional guidance to apply the fix and again apologies for the inconvenience.
For now the issue seems to be confined to ubuntu18. At least I was able to create a cluster with centos 7.
[UPDATE] Because of a bug on our side the NVIDIA drivers are not installed on Ubuntu18. So far I see the following workarounds: