hyperqueue: Multiple workers-per-allocation not working

So from what I understood --workers-per-alloc <worker_count> is used to run multiple workers on multiple nodes, but it doesn’t seem to behave that way. For example if I want to run 5 calculations on one node, I use an automatic allocation with 1 worker per allocation which then launches 5 jobs (1 task per job) on 1 node. So this is working as expected. Now if I launch an allocation with 2 workers per allocation, I was expecting that my 10 calculations would run on 2 nodes, with each node having its own worker. But what happens is that 2 workers occupy 2 nodes, but any of the 10 calculations does not launch stating the following error -

srun: Job 748536 step creation temporarily disabled, retrying (Requested nodes are busy)

Please note that the nodes are not in fact busy as there is one ‘big’ Slurm job running on these 2 nodes. But there is no process running on either nodes if I check with top or htop.

I may be completely misunderstanding the purpose of this --workers-per-alloc <worker_count> command. In which case, would it be possible to do what I am trying to do in some other way? Or is it a use case that is not planned to be supported?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19

Most upvoted comments

I was talking about multi-node MPI specifically. While there’s nothing stopping you from using multi-node MPI, currently you can run each task on a single node only. So HyperQueue cannot guarantee that you will get multiple nodes available for the task at the same time. There is support for multi-node tasks, but it’s currently heavily experimental and WIP.

Using MPI on a single node should be completely fine, you can just say how many cores does your task need.

Regarding the issue that tasks may not be killed properly when a task is cancelled or a worker is killed, we are aware of it (https://github.com/It4innovations/hyperqueue/issues/431).

In general, please report any issues that you find (unless they are a strict duplicate of some existing issue) 😃