dask-jobqueue: SLURM cluster fails with unrecognized option '--parsable'

I’m trying out a Pangeo deployment on our local HPC system at UAlbany, which uses SLURM. Basically following these instructions from the Pangeo documentation.

Running this code from within a Jupyter notebook:

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=4,
                       cores=4,
                       memory="16GB",
                       walltime="01:00:00",
                       queue="snow-1")
cluster.scale(4)

fails with the following:

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2b8d1bfb7bf8>, 4)
Traceback (most recent call last):
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 416, in scale_up
    self.start_workers(n - self._count_active_and_pending_workers())
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 330, in start_workers
    out = self._submit_job(fn)
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 322, in _submit_job
    return self._call(shlex.split(self.submit_command) + [script_filename])
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 383, in _call
    cmd_str, out, err))
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch --parsable /tmp/tmpo8zdikq3.sh
stdout:

stderr:
sbatch: unrecognized option '--parsable'
Try "sbatch --help" for more information

I generated the same error message by running

sbatch --parsable

directly on the command line.

It’s possible that this is because we are running a very old version of SLURM:

[br546577@snow-23 ~]$ sbatch --version
slurm 2.5.1

Workarounds?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (12 by maintainers)

Most upvoted comments

@brian-rose - that actually looks right.

Why the workers aren’t starting appears to be unrelated to your original issue. From here, I suggest working through https://dask-jobqueue.readthedocs.io/en/latest/debug.html. In particular, the cluster.job_script() method seems to be very useful for understanding what jobqueue is doing and how it is interfacing with your scheduler.

It’s possible that this is because we are running a very old version of SLURM:

I googled a bit and found this commit which seems to have reached the 14-03-0-1 release released in Mar 26, 2014.

The first thing I would suggest is asking your sys-admin whether there is a slight chance to update the SLURM install. Maybe unlikely but I guess it’s worth a try.

A work-around in dask-jobqueue would be to not use --parsable and get the job id from the stdout produced by sbatch the_temporary_script.sh + a regex. A PR doing that would be more than welcome!

You may want to look at https://github.com/dask/dask-jobqueue/pull/45 that added --parsable in dask-jobqueue and the reasons that motivated the change. IIRC the main reason was: it is cleaner to avoid post-treatment in dask-jobqueue as much as possible, but of course we did not imagine that --parsable was a problem in very old SLURM installs …

Another thing to draw inspiration from IPython.parallel and how they get the jobid from the submit command output in: https://github.com/ipython/ipyparallel/blob/6.1.1/ipyparallel/apps/launcher.py.