dask-jobqueue: SLURM cluster fails with unrecognized option '--parsable'
I’m trying out a Pangeo deployment on our local HPC system at UAlbany, which uses SLURM. Basically following these instructions from the Pangeo documentation.
Running this code from within a Jupyter notebook:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=4,
cores=4,
memory="16GB",
walltime="01:00:00",
queue="snow-1")
cluster.scale(4)
fails with the following:
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2b8d1bfb7bf8>, 4)
Traceback (most recent call last):
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 416, in scale_up
self.start_workers(n - self._count_active_and_pending_workers())
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 330, in start_workers
out = self._submit_job(fn)
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 322, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 383, in _call
cmd_str, out, err))
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch --parsable /tmp/tmpo8zdikq3.sh
stdout:
stderr:
sbatch: unrecognized option '--parsable'
Try "sbatch --help" for more information
I generated the same error message by running
sbatch --parsable
directly on the command line.
It’s possible that this is because we are running a very old version of SLURM:
[br546577@snow-23 ~]$ sbatch --version
slurm 2.5.1
Workarounds?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 19 (12 by maintainers)
@brian-rose - that actually looks right.
Why the workers aren’t starting appears to be unrelated to your original issue. From here, I suggest working through https://dask-jobqueue.readthedocs.io/en/latest/debug.html. In particular, the
cluster.job_script()method seems to be very useful for understanding what jobqueue is doing and how it is interfacing with your scheduler.I googled a bit and found this commit which seems to have reached the 14-03-0-1 release released in Mar 26, 2014.
The first thing I would suggest is asking your sys-admin whether there is a slight chance to update the SLURM install. Maybe unlikely but I guess it’s worth a try.
A work-around in
dask-jobqueuewould be to not use--parsableand get the job id from the stdout produced bysbatch the_temporary_script.sh+ a regex. A PR doing that would be more than welcome!You may want to look at https://github.com/dask/dask-jobqueue/pull/45 that added
--parsableindask-jobqueueand the reasons that motivated the change. IIRC the main reason was: it is cleaner to avoid post-treatment indask-jobqueueas much as possible, but of course we did not imagine that--parsablewas a problem in very old SLURM installs …Another thing to draw inspiration from
IPython.paralleland how they get the jobid from the submit command output in: https://github.com/ipython/ipyparallel/blob/6.1.1/ipyparallel/apps/launcher.py.