aws-parallelcluster: Compute nodes suddenly failing to start

Environment:

  • aws-parallelcluster-2.4.1
  • OS: alinux
  • Scheduler: Slurm
  • Master instance type: t2.medium
  • Compute instance type: t2.2xlarge

Bug description and how to reproduce: Compute nodes are suddenly failing to spawn. Initial cluster creation worked fine but new nodes keep dying due to a failure in the initialization.

Additional context:

Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: config-scripts-per-once already ran (freq=once)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-per-boot (<module 'cloudinit.config.cc_scripts_per_boot' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_boot.pyc'>) with frequency always
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-per-boot using lock (<cloudinit.helpers.DummyLock object at 0x7f020280d810>)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-per-instance (<module 'cloudinit.config.cc_scripts_per_instance' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_instance.pyc'>) with frequency once-per-instance
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_per_instance - wb: [644] 20 bytes
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-per-instance using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_per_instance'>)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) with frequency once-per-instance
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_user - wb: [644] 20 bytes
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-user using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_user'>)
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-002'] with allowed return codes [0] (shell=True, capture=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 645, in runparts
    subp(prefix + [exe_path], capture=False, shell=True)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1626, in subp
    cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-002']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=True, capture=False)
Sep 27 14:34:38 cloud-init[3245]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Sep 27 14:34:38 cloud-init[3245]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules
    cc.run(run_name, mod.handle, func_args, freq=freq)
  File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run
    results = functor(*args)
  File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.py", line 38, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 652, in runparts
    % (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 2 attempted commands
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_ssh_authkey_fingerprints - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-ssh-authkey-fingerprints using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_ssh_authkey_fingerprints'>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 512 bytes from /etc/ssh/sshd_config
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /home/ec2-user/.ssh/authorized_keys (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 404 bytes from /home/ec2-user/.ssh/authorized_keys
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module keys-to-console (<module 'cloudinit.config.cc_keys_to_console' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_keys_to_console.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_keys_to_console - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-keys-to-console using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_keys_to_console'>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running command ['/usr/libexec/cloud-init/write-ssh-key-fingerprints', '', 'ssh-dss'] with allowed return codes [0] (shell=False, capture=True)
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module phone-home (<module 'cloudinit.config.cc_phone_home' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_phone_home.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_phone_home - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-phone-home using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_phone_home'>)
Sep 27 14:34:38 cloud-init[3245]: cc_phone_home.py[DEBUG]: Skipping module named phone-home, no 'phone_home' configuration found
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module final-message (<module 'cloudinit.config.cc_final_message' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_final_message.pyc'>) with frequency always
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-final-message using lock (<cloudinit.helpers.DummyLock object at 0x7f02024694d0>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Cloud-init v. 0.7.6 finished at Fri, 27 Sep 2019 14:34:38 +0000. Datasource DataSourceEc2.  Up 460.71 seconds
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instance/boot-finished - wb: [644] 52 bytes
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module power-state-change (<module 'cloudinit.config.cc_power_state_change' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_power_state_change.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_power_state_change - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-power-state-change using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_power_state_change'>)
Sep 27 14:34:38 cloud-init[3245]: cc_power_state_change.py[DEBUG]: no power_state provided. doing nothing
Sep 27 14:34:38 cloud-init[3245]: cloud-init[DEBUG]: Ran 9 modules with 1 failures
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json'
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: cloud-init mode 'modules' took 435.260 seconds (435.05)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (9 by maintainers)

Most upvoted comments

Yes, I tested without a custom ami using amazon Linux , I spun up a machine added the necessary tools and then generated an image. The image isn’t public at the moment. It seems to be an issue with the intel mpi repo as well. The imported key wasn’t checking properly.

On Mon, Sep 30, 2019 at 4:12 AM Enrico Usai notifications@github.com wrote:

Hi @medcelerate https://github.com/medcelerate thank you for your analysis and your logs.

I see you are using a custom_ami. It could be a problem related to your custom ami.

  1. Did you test without the custom_ami parameter?
  2. Which https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html process did you follow to build your custom AMI?
  3. Which is the source AMI Id of your AMI (if is it public)?

Thank you

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws/aws-parallelcluster/issues/1334?email_source=notifications&email_token=AHYKRGNA235FYN7RYPPWBSDQMGYG7A5CNFSM4I3HHWCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD742BBI#issuecomment-536453253, or mute the thread https://github.com/notifications/unsubscribe-auth/AHYKRGOHJFIFPLUBVJKEIV3QMGYG7ANCNFSM4I3HHWCA .

@medcelerate @jflournoy There’s an issue with the GPG key required for Intel MPI, we’re currently working on a fix.

Hi @medcelerate to understand the cause of your issue we need more information:

  • configuration file without any credentials or personal data.
  • /var/log/cfn-init.log, /var/log/cloud-init.log and /var/log/cloud-init-output.log files from the Master node
  • if a compute node was terminated due to failure, there will be a directory /home/logs/compute. Attach one of the instance-id.tar.gz from that directory
  • /var/log/nodewatcher from the Compute node and /var/log/jobwatcher and /var/log/sqswatcher from the Master node