legion: Rank-per-node slower than rank-per-socket with OpenMP workload

Hi @streichler, I’ve been working with @rohany on running some experiments with DISTAL on Sapling. We’ve been using the TBLIS tensor contraction library, which internally uses OpenMP, but have been seeing some strange performance behavior.

Compare:

$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 8 -lg:prof_logfile prof99-socket-%.log.gz

  # (Side note: the above command prints a few "reservation cannot be satisfied" warnings;
  # that is the subject of a different issue that will be filed separately.)
  # Update: filed as https://github.com/StanfordLegion/legion/issues/1267

$ mpirun -H c0001,c0002,c0003,c0004 --bind-to none \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 4 -lg:prof_logfile prof99-node-%.log.gz

The profiles are available at

/scratch2/tigu/taco/distal/build/prof99-socket-*.log.gz – http://sapling.stanford.edu/~tigu/prof99-socket
/scratch2/tigu/taco/distal/build/prof99-node-*.log.gz – http://sapling.stanford.edu/~tigu/prof99-node

It seems clear that the per-socket version is using processors much more efficiently, but it’s not clear to us why. On a single node, TBLIS/OpenMP is much (2×) faster than using mpirun to bind computation to each socket (using n=70 for comparable problem size):

$ mpirun -H c0001:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 70 -tblis -gx 2 -gy 1 \
    -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0

$ /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 70 -tblis -gx 2 -gy 1 \
    -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0

Thanks in advance for your help!

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 18 (14 by maintainers)

Most upvoted comments

TBLIS developer here. If I am understanding correctly, a) the comparison is between using one MPI rank + OpenMP per socket with thread binding to that socket vs. using two sort of “virtual” ranks per node with separate OpenMP runtimes and no thread binding, and b) the former approach is showing better performance?

TBLIS uses a pool of dynamically-allocated memory blocks for packing tensor data during computation. New blocks are allocated when not enough are available (i.e. on the first contraction call typically), and then reused in later computations. Blocks are checked out of a linked-list structure which uses a global lock (could be an OpenMP lock or a spin-lock). In the latter case above, if threads are free to migrate across the sockets then it could be the case that a thread running on socket 0 checks out a memory block created by a thread on socket 1 or vice versa. This will cause excessive memory traffic between sockets in a non-deterministic fashion.

Perhaps this is what you are seeing?

devinamatthews on May 28, 2022