aws-parallelcluster: Parallelcluster 2.1.1 with raid 0 config on Cent OS 7 fails in create cluster

Environment:

  • AWS ParallelCluster 2.1.1
  • OS: Cent OS 7
  • Scheduler: SGE
  • Master instance type: m5.large
  • Compute instance type: m5.xlarge

Bug description and how to reproduce: Deploying a ParallelCluster 2.1.1 with Raid 0 configuration fails with this error.

Beginning cluster creation for cluster: cluster1
Creating stack named: parallelcluster-cluster1
Status: parallelcluster-cluster1 - ROLLBACK_IN_PROGRESS
Cluster creation failed.  Failed events:
  - AWS::EC2::Instance MasterServer Received FAILURE signal with UniqueId i-0ecca142dxxxxx

I thought the failure could be because I’m using encrypted EBS volumes with custom KMS key but I commented out both encrypted and ebs_kms_key_id settings but still the same failure.

Additional context: Any other context about the problem. E.g.:

  • configuration file without any credentials or personal data.
[global]
update_check = true
sanity_check = true
cluster_template = default

[aws]
aws_region_name = us-west-2

[cluster default]
vpc_settings = vpc-0094xxxxx
key_name = cdns-cluster
base_os = centos7
compute_instance_type = m5.2xlarge
master_instance_type = m5.large
#compute_root_volume_size = 20
#master_root_volume_size = 20
initial_queue_size = 0
tags = {"BU" : "IT", "Sub_BU" : "IT"}
raid_settings = rs
#extra_json = { "cluster" : { "ganglia_enabled" : "yes" } }

[vpc vpc-0094xxxxx]
vpc_id = vpc-0094xxxxx
master_subnet_id = subnet-06cxxxxxx
use_public_ips = false
ssh_from = 172.16.0.0/12

[raid rs]
shared_dir = raid
raid_type = 0
num_of_raid_volumes = 2
volume_size = 100
encrypted = true
ebs_kms_key_id = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

When I created the cluster with --norollback option, I can see that the master has a 20GB disk mounted and exported under /shared and also noticed that the 2 disks for the raid0 configuration are not attached to the master.

Attachments: cfn-init.log cloud-init.log

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (10 by maintainers)

Commits related to this issue

Most upvoted comments

yes, same error. Please use m4/c4’s until the next release of ParallelCluster

Hit exactly the same issue when attaching two EBS volumes. I think aws/aws-parallelcluster-cookbook#253 will fix the problem. Just post the problem here for record.

pcluster version: 2.1.1 Full log: cfn-init.log

Major error message:

  * execute[attach_volume_1] action run
    
    ================================================================================
    Error executing action `run` on resource 'execute[attach_volume_1]'
    ================================================================================
    
    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of /usr/local/sbin/attachVolume.py vol-0fdf6b613b8d8704b ----
    STDOUT: 
    STDERR: Traceback (most recent call last):
      File "/usr/local/sbin/attachVolume.py", line 90, in <module>
        main()
      File "/usr/local/sbin/attachVolume.py", line 68, in main
        response = ec2.attach_volume(VolumeId=volumeId, InstanceId=instanceId, Device=dev)
      File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 357, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 661, in _make_api_call
        raise error_class(parsed_response, operation_name)
    botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the AttachVolume operation: Invalid value '/dev/sdb' for unixDevice. Attachment point /dev/sdb is already in use
    ---- End output of /usr/local/sbin/attachVolume.py vol-0fdf6b613b8d8704b ----
    Ran /usr/local/sbin/attachVolume.py vol-0fdf6b613b8d8704b returned 1

Configuration file:

[cluster ebstest]
vpc_settings = public
key_name = ...
base_os = ubuntu1604
master_instance_type = m5.large
compute_instance_type = c5.large
ebs_settings = input,output

[ebs input]
shared_dir = input
volume_type = gp2
volume_size = 150

[ebs output]
shared_dir = output
volume_type = gp2
volume_size = 150

No error with only one EBS volume. No error when using m4.large instead of m5.large as master node as pointed out by https://github.com/aws/aws-parallelcluster/issues/823#issuecomment-452850726.