legion: Realm: slow data transposes on CPUs

I’ve recently implemented a version of HTR that uses various layouts of the same data to optimize the loop performance. There are currently two approaches that have been implemented in the solver:

Approach 1 All the instances managed by the runtime are ordered with the dimensions X, Y, Z, and each task requiring transposed data creates a DeferredBuffer and copies the data into the buffer using an OpenMP loop.

Approach 2 Set the layout constraints for each task so that Realm performs the copies from instances with the following orders X, Y, Z, Y, X Z, Z X Y.

Considering that at least two tasks reuse the same data in a single layout, Approach 1 makes many more copies with transposes and is expected to be slower. Approach 2 reuses the same “transposed” instance for multiple tasks.

I’ve tested the two implementations solving a three-periodic flow on a 128^3 grid with the following wall times:

Approach 1: 37.935 s
Approach 1 (deactivating the OpenMP optimization for the loops that copy data to the deferred buffer): 77.906 s
Approach 2: 136.324 s

PS: @elliottslaughter could you please add this issue to #1032 with low priority?

About this issue

Original URL
State: open
Created a year ago
Comments: 28 (18 by maintainers)

Most upvoted comments

I apologize as I have re-posted the same comment under:

https://github.com/StanfordLegion/legion/issues/1494#issuecomment-1614990868

But yeah the numbers are there and for the problem size in your very initial post here e.g. 128^3 grid we get about ~6.2 seconds now on sapling2. The similar numbers I have for ChannelFlow. I will be offline and will likely be submitting the patch early next week.

apryakhin on Jun 30, 2023