legion: Unable to run legion implementation

We implemented a 3d finite-volume simulation using Legion (v22.06.0) in C++. The solution produces correct results for arbitrary problem and partition sizes. We want to execute our implementation using legion on a cluster. However, we experience multiple problems and appreciate help, as we found nearly no documentation regarding multi-node execution.

Our cluster consists of 8 nodes. Each node has:

  • 2 Intel Quad-Core L5420 CPUs with 4 cores each (8 cores per node)
  • 4gb DDR2 RAM per core (32gb per node)

Job submission is done via Slurm (Version 22.05.2).

Below are our questions:

  1. Is there any up-to-date documentation regarding execution? We found:

    Thus, both links don’t really contribute to our problem.

  2. According to the profiling machine configuration page, -ll:cpu is used to specify the number of processors to allocate on each node. In our configuration with 8 nodes and 8 cores per node, we chose to use -ll:cpu 7 and -ll:util 1. This results in oversubscription, while the nodes are not oversubscribed.

    • Error: [0 - 7f232437d780] {4}{threads}: reservation ('CPU proc 1d00000000000008') cannot be satisfied
    • According to our tests, this error vanishes if the sum of both parameters is smaller than the number of cores per socket (in our configuration using -ll:cpu 3 and -ll:util 1). However, according to the documentation, typically only one Legion instance is running per node (See Debugging)
    • To exclude the potential for wrong SLURM configurations, we executed the program on the VSC-5 cluster with adjusted parameters but obtained the same result.
  3. Using -ll:show_rsrv, we obtain output as seen below: CPU proc 1d00000000000002: allocated <> Notable is the missing CPU id after allocated. What ids are stated here (i.e. are these the processor ids from /proc/cpuinfo)? Any suggestions why these are missing? How does legion obtain the internal mapping id 1d00000000000002?

    • Same exists on the VSC-5 cluster with adjusted parameters
  4. -lg:prof <n> and -lg:prof_logfile ./prof_%.gz write to one output file, regardless of the specified number of nodes (in our case 8)

    • The resulting file is not parsable using the legion_prof.py script (using the multi-node configuration), probably as multiple processes are writing to the same file in parallel.

Jobscript on our cluster (submitted via sbatch):

#!/bin/bash

#SBATCH -J cronos-amr-parallel
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --output=output_%N.out
#SBATCH --error=error_%N.out
#SBATCH --nodes=8
#SBATCH --exclusive

srun -N8 -n8 --cpus-per-task 8 --output output_%N.log ./cronos-amr ./configuration/shock-tube-integration.toml -ll:cpu 7 -ll:util 1 -ll:show_rsrv -lg:prof 8 -lg:prof_logfile ./prof_%.gz

Logs and prof output can be found here: logs.zip

(we know that for real performance numbers, we should implement an environment specific mapper)

It would be great if you could help us, we’d like to compare our legion implementation with our OpenMP and MPI implementation.

Thanks 🙂

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 25 (16 by maintainers)

Most upvoted comments

Sorry, alu=<1> means that the core (core 0, in this case) shares an alu with core 1. On a P9 system, you’ll see things like:

core 0 { ids=<0> alu=<1,2,3> … }

because it has 4 “hyperthreads” (can’t remember what IBM actually calls them) per CPU core.