sos: SoS submitter hang on cluster

I’ve experienced various hanging behavior for the job submitter. “Hang” here means that job queue is empty but SoS refuses to move on. It looks like stuck on the current job submission, yet ps -A | grep sos shows nothing. With ctrl-c I can keyboard interrupt it.

There are now 2 types of hangs I can reliably reproduce. Hopefully by describing them you’ll be able to make some MWE for your cluster:

  1. When my job exceeds the walltime
  2. When my specified directory for err and out files do not exist, eg:
      #SBATCH --output={cur_dir}/non_existing_dir/{job_name}.out
      #SBATCH --error={cur_dir}/non_existing_dir/{job_name}.err

I hope this is enough to reproduce it.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 19 (19 by maintainers)

Most upvoted comments

Yes, that is my suggestion, perhaps $HOME to allow shell expansion of HOME is better because SoS would expand {home_dir} to host-specific full directory with user name, which is arguable better be replaced with a generic $HOME.