dask-jobqueue: AssertionError (assert count > 0) in SLURMCluster._adapt

When I run a slurm cluster with adapt() I sometimes get the following crash (but this is not deterministic and I have not identified a way to trigger it more often).

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f59359aed90>, <Future finished exception=AssertionError()>)
Traceback (most recent call last):
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
    future.result()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/deploy/adaptive.py", line 334, in _adapt
    workers = yield self._retire_workers(workers=recommendations['workers'])
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/deploy/adaptive.py", line 242, in _retire_workers
    close_workers=True)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2800, in retire_workers
    n=1, delete=False)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2613, in replicate
    assert count > 0
AssertionError

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 33 (18 by maintainers)

Most upvoted comments

@ogrisel do you still encounter this bug?