legion: Realm: slow data transposes on CPUs
I’ve recently implemented a version of HTR that uses various layouts of the same data to optimize the loop performance. There are currently two approaches that have been implemented in the solver:
Approach 1
All the instances managed by the runtime are ordered with the dimensions X, Y, Z, and each task requiring transposed data creates a DeferredBuffer and copies the data into the buffer using an OpenMP loop.
Approach 2
Set the layout constraints for each task so that Realm performs the copies from instances with the following orders X, Y, Z, Y, X Z, Z X Y.
Considering that at least two tasks reuse the same data in a single layout, Approach 1 makes many more copies with transposes and is expected to be slower. Approach 2 reuses the same “transposed” instance for multiple tasks.
I’ve tested the two implementations solving a three-periodic flow on a 128^3 grid with the following wall times:
- Approach 1: 37.935 s
- Approach 1 (deactivating the OpenMP optimization for the loops that copy data to the deferred buffer): 77.906 s
- Approach 2: 136.324 s
PS: @elliottslaughter could you please add this issue to #1032 with low priority?
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 28 (18 by maintainers)
I apologize as I have re-posted the same comment under:
But yeah the numbers are there and for the problem size in your very initial post here e.g.
128^3grid we get about ~6.2 seconds now on sapling2. The similar numbers I have forChannelFlow. I will be offline and will likely be submitting the patch early next week.