aws-parallelcluster: Failing to create cluster when using GPU instance with ubuntu 18 and Slurm

Environment:

  • AWS ParallelCluster / CfnCluster version: 2.5.0
  • OS: Ubuntu 18.04
  • Scheduler: Slurm
  • Master instance type: g3.4xlarge
  • Compute instance type: g3.4xlarge

Bug description and how to reproduce:

Master instance initializes properly but compute instances get stuck in Initializing state in EC2 console and hence overall cluster setup fails. I let parallelcluster initialize VPCs, security groups, etc. all on its own, they should all be default/valid (I did some manual verification to e.g. make sure VPC settings matched what the docs said they should be).

$ pcluster create -c /home/foo/.parallelcluster/config cluster1
Beginning cluster creation for cluster: cluster1
Creating stack named: parallelcluster-cluster1
Status: parallelcluster-cluster1 - ROLLBACK_IN_PROGRESS                          		
Cluster creation failed.  Failed events:
  - AWS::AutoScaling::AutoScalingGroup ComputeFleet Received 2 FAILURE signal(s) out of 2.  Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

Additional context:

Config file:

[aws]
aws_region_name = us-west-2

[global]
cluster_template = default
update_check = true
sanity_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default]
key_name = gpu_keypair
base_os = ubuntu1804
scheduler = slurm
master_instance_type = g3.4xlarge
compute_instance_type = g3.4xlarge
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = true
vpc_settings = default

[vpc default]
vpc_id = vpc-02de6aff5174ff11e
master_subnet_id = subnet-0cc14c110a85fdd4e

master /var/log/cfn-init.log: attached master /var/log/cloud-init.log: attached master /var/log/cloud-init-output.log: attached master /var/log/jobwatcher: attached master /var/log/sqswatcher: attached cannot connect to compute nodes, no logs attached

cloud-init-output.log cloud-init.log cfn-init.log jobwatcher.log sqswatcher.log

Thanks in advance for your help!!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (11 by maintainers)

Commits related to this issue

Most upvoted comments

We put together and tested a script to patch the issue you are facing and unblock your cluster creation. The script is the following:

#!/bin/bash

set -e

wget https://us.download.nvidia.com/tesla/418.87/NVIDIA-Linux-x86_64-418.87.01.run -O /tmp/nvidia.run
chmod +x /tmp/nvidia.run
/tmp/nvidia.run --silent --dkms --install-libglvnd
rm -f /tmp/nvidia.run

It takes around 50 seconds to execute the script and install the missing drivers. There are 2 alternative ways you can apply this fix:

  1. Upload the script to an S3 bucket and use it as a cluster pre_install script as documented here. This means that every node of the cluster is going to take 1 additional minute to be bootstrapped.
  2. Alternatively, if you don’t want to spend any extra time at node start-up, you can create a custom ami as documented here and execute the script to install the drivers as part of the step to customize your instance. Then use the custom_ami with your cluster.

In the meanwhile we are going to work with priority on a permanent fix and provide a new official patch release as soon as possible.

Please let us know if you need additional guidance to apply the fix and again apologies for the inconvenience.

For now the issue seems to be confined to ubuntu18. At least I was able to create a cluster with centos 7.

[UPDATE] Because of a bug on our side the NVIDIA drivers are not installed on Ubuntu18. So far I see the following workarounds:

  • Switch to a different OS
  • Install NVIDIA drivers at runtime (we can provide instructions)
  • Create a custom AMI that installs this drivers (we can provide instructions)