aws-parallelcluster: SSL Verification Error during cluster creation

If you are reporting an issue with AWS Parallelcluster / CfnCluster please make sure to add the following data in order to facilitate the root cause detection:

Required Info:

  • AWS ParallelCluster version [e.g. 2.9.0]:2.10.4
  • Full cluster configuration without any credentials or personal data [global] cluster_template = default update_check = true sanity_check = true

[aws] aws_region_name = us-gov-east-1

[aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default] key_name = xxxxxxxxxxxxxx scheduler = slurm master_instance_type = c5n.large base_os = alinux2 vpc_settings = default queue_settings = compute custom_ami = ami-xxxxxxxxxxxxxxxxx ec2_iam_role = Proj_EC2SysAdmin additional_iam_policies = arn:aws:iam::aws:policy/AdministratorAccess cluster_resource_bucket = gs-hpc-pcluster s3_read_write_resource = arn:aws:s3:::xxxxxxxxxxxxx/* ebs_settings = volume1 pre_install = s3://xxxxxxxxxxxxx/pclusterPreinstall.sh

[compute_resource default] instance_type = c5n.18xlarge min_count = 1 max_count = 9

[ebs volume1] ebs_kms_key_id = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ebs_volume_id = vol-xxxxxxxxxxxxxxxxx encrypted = true volume_iops = 100 volume_type = gp2

[queue compute] compute_resource_settings = default enable_efa = true enable_efa_gdr = false

[scaling default] scaledown_idletime = 10

[vpc default] vpc_id = vpc-xxxxxxxxxxxxxxxxx master_subnet_id = subnet-xxxxxxxxxxxxxxxxx

use_public_ips = false vpc_security_group_id = sg-xxxxxxxxxxxxxxxxx

  • Cluster name: pcluster-hpc
  • [Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce: I’m new to AWS ParallelCluster. Cluster creation fails during Master Server Instance creation. I got into the Master VM and looked at the /var/log/clound-init-output.log file It appears to download and install cinc-15.11.8-1.el7.x86_64.rpm Then, it fails to install berkshelf because our corporate Root CA is not trusted /opt/cinc/embedded/bin/gem install --no-document berkshelf:7.0.10 ERROR: SSL verification error

Because of -nr option, I used the failed Master Instance and manually installed cinc and berkshelf Then, created a custom AMI, but the same failure keeps happening

I’ve tried the following to get gem to trust our corporate CA to install berkshlef:

  1. add our corporate CA file to /opt/cinc/embedded/ssl/certs/cacert.pem. However, cloud-init always installs cinc even though it is already installed. The installation appears to reset the whole /opt/cinc/… folder path
  2. I tried to set an environment variable SSL_CERT_FILE to the RHEL location of our corporate CA pem file, but the environment variable is not being used during cluster creation. But, when I log in as ec2-user or root, the environment variable is set and I can do the gem install of berkshelf. I’m thinking cloud-init and python has something to do with this. I’ve tried placing the env variable in .bashrc files, /etc/bashrc, /etc/environment, /etc/bashrc, /etc/profile.d/<bash and csh scripts>
  3. I also tried a pre-install script to set environment variables, but I don’t my pre-install script is being executed at this point of cluster creation since I don’t see any log statements and I also tried to touch a file for confirmation.

I don’t know what else to try… [UPDATE] I figure more details on the problem would help.

When the Master instance is getting started for the first time, it

  1. downloads and installs: https://us-gov-east-1-aws-parallelcluster.s3.us-gov-east-1.amazonaws.com/archives/cinc/el/7/cinc-15.11.8-1.el7.x86_64.rpm
  2. Then, executes /opt/cinc/embedded/bin/gem install --no-document berkshelf:7.0.10 The gem install fails will SSL verification error. I was puzzled by the unable to download from https://rubygens.org because of our corporate Root CA isn’t trusted… I’m guessing rubygems.org is being spoofed by a GD server with a GD certificate. Since the /opt/cinc/embedded/ssl/certs/cacert.pem file is a GlobalSign CA cert and not our corporate Root CA cert, the corporate “rubygems.org” server isn’t trusted. Since cinc is installed every time and resetting the cacert.pem file every time, I can’t get the gem install to trust our corporate Root CA.

If you are reporting issues about scaling or job failure: We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

For issues with AWS ParallelCluster >= v2.9.0 and scheduler == slurm, please attach the following logs:

  • From Head node: /var/log/parallelcluster/clustermgtd.log, /var/log/parallelcluster/slurm_resume.log, /var/log/parallelcluster/slurm_suspend.log, and /var/log/slurmctld.log
  • From Compute node: /var/log/parallelcluster/computemgtd.log, and /var/log/slurmd.log

Otherwise, please attach the following logs:

  • From Head node: /var/log/jobwatcher, /var/log/sqswatcher, and /var/log/slurmctld.log if scheduler == slurm.
  • From Compute node:/var/log/nodewatcher, and /var/log/slurmd.log if scheduler == slurm

If you are reporting issues about cluster creation failure or node failure:

If the cluster fails creation, please re-execute create action using --norollback option.

We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

  • From Head node: /var/log/cloud-init.log, /var/log/cfn-init.log, and /var/log/chef-client.log
  • From Compute node: /var/log/cloud-init-output.log

Additional context: Any other context about the problem. E.g.:

  • pre/post-install scripts, if any
  • screenshots, if useful

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (11 by maintainers)

Most upvoted comments

cool, but I am having problems with createami. I tried it out on my Windows box and t3.small VM, and failed Master instance. I keep getting the same error.

(pcluster) [ssm-user@ip-10-149-11-115 pcluster]$ pcluster createami -i c5n.large -ai ami-xxxxxxxxxxxxxx -os alinux2 Building AWS ParallelCluster AMI. This could take a while… /home/ssm-user/pcluster/lib/python2.7/site-packages/boto3/compat.py:86: PythonDeprecationWarning: Boto3 will no longer support Python 2.7 starting July 15, 2021. To continue receiving service updates, bug fixes, and security updates please upgrade to Python 3.6 or later. More information can be foundhere: https://aws.amazon.com/blogs/developer/announcing-end-of-support-for-python-2-7-in-aws-sdk-for-python-and-aws-cli-v1/ warnings.warn(warning, PythonDeprecationWarning) Base AMI ID: ami-xxxxxxxx Base AMI OS: alinux2 Instance Type: c5n.large Region: us-gov-east-1 VPC ID: vpc-xxxxxxxxx Subnet ID: subnet-xxxxxxxxx Template: https://us-gov-east-1-aws-parallelcluster.s3.us-gov-east-1.amazonaws.com/templates/aws-parallelcluster-2.10.4.cfn.json Cookbook: https://us-gov-east-1-aws-parallelcluster.s3.us-gov-east-1.amazonaws.com/cookbooks/aws-parallelcluster-cookbook-2.10.4.tgz Post install script dir not specified. Packer log: /tmp/packer.log.20210618-200041.ndW0q1 Failed to run /tmp/tmp6bekm7/aws-parallelcluster-cookbook-2.10.4/amis/build_ami.sh --os alinux2 --partition region --region us-gov-east-1 --custom --arch x86_64 Command not found

No custom AMI created