dask-jobqueue: SLURM cluster fails with unrecognized option '--parsable'

I’m trying out a Pangeo deployment on our local HPC system at UAlbany, which uses SLURM. Basically following these instructions from the Pangeo documentation.

Running this code from within a Jupyter notebook:

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=4,
                       cores=4,
                       memory="16GB",
                       walltime="01:00:00",
                       queue="snow-1")
cluster.scale(4)

fails with the following:

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2b8d1bfb7bf8>, 4)
Traceback (most recent call last):
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 416, in scale_up
    self.start_workers(n - self._count_active_and_pending_workers())
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 330, in start_workers
    out = self._submit_job(fn)
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 322, in _submit_job
    return self._call(shlex.split(self.submit_command) + [script_filename])
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 383, in _call
    cmd_str, out, err))
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch --parsable /tmp/tmpo8zdikq3.sh
stdout:

stderr:
sbatch: unrecognized option '--parsable'
Try "sbatch --help" for more information

I generated the same error message by running

sbatch --parsable

directly on the command line.

It’s possible that this is because we are running a very old version of SLURM:

[br546577@snow-23 ~]$ sbatch --version
slurm 2.5.1

Workarounds?

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 19 (12 by maintainers)

Most upvoted comments

@brian-rose - that actually looks right.

Why the workers aren’t starting appears to be unrelated to your original issue. From here, I suggest working through https://dask-jobqueue.readthedocs.io/en/latest/debug.html. In particular, the cluster.job_script() method seems to be very useful for understanding what jobqueue is doing and how it is interfacing with your scheduler.

jhamman on Jan 30, 2019

It’s possible that this is because we are running a very old version of SLURM:

I googled a bit and found this commit which seems to have reached the 14-03-0-1 release released in Mar 26, 2014.

The first thing I would suggest is asking your sys-admin whether there is a slight chance to update the SLURM install. Maybe unlikely but I guess it’s worth a try.

A work-around in dask-jobqueue would be to not use --parsable and get the job id from the stdout produced by sbatch the_temporary_script.sh + a regex. A PR doing that would be more than welcome!

You may want to look at https://github.com/dask/dask-jobqueue/pull/45 that added --parsable in dask-jobqueue and the reasons that motivated the change. IIRC the main reason was: it is cleaner to avoid post-treatment in dask-jobqueue as much as possible, but of course we did not imagine that --parsable was a problem in very old SLURM installs …

Another thing to draw inspiration from IPython.parallel and how they get the jobid from the submit command output in: https://github.com/ipython/ipyparallel/blob/6.1.1/ipyparallel/apps/launcher.py.

lesteve on Jan 29, 2019