iree: Dispatch workgroups along X only to prevent exceeding max number of block on GPU

What happened?

Both CUDA and Vulkan may have a low max number (65535) of workgroups along dimensions Y and Z. Even though this is not a complete solution, one case that should fix the problem for a while is to only dispatch workgroups along X. This is what current XLA does at the moment and seems to have been working for a while.

In order to do that we can change the pass TileAndDistributeToWorkgroup to decide how many dimensions we can dispatch along. Right now this is hardcoded to 3 here, we should change kNumMaxParallelDims to be a pass option and use 1 for both LLVMGPU and Vulkan backend.

Steps to reproduce your issue

No response

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Mid- to long-term we really want to be semantically using the workgroup IDs to denote locality - linearizing and dividing out is usually a good starting point and if it helps this issue that’d be awesome. We need that on CPU and it can help GPU too (I’m not sure we’re too bad there today, but also don’t think anyone has verified for example that when we have increasing x we’re actually traversing memory in increasing x).

IDs are meant to represent the order of processing of data. This is already what we do right now and this happens when we do tile and distribute and hopefully we keep it separated from the IDs.

Another thing we’re going to need to handle is collapsing dimensions and blocking - there’s no GPU with 20 million threads, so scheduling 20 million threads is silly - we should oversubscribe the device but not by several orders of magnitude 😃 We should be able to generate workgroups (thread blocks, etc) that do more than one thing and use that to handle these overflows.

Right, the difference is that it will have a performance cost so we have to decide when we want to use it.

Should we split the Vulkan issue into a separate issue ?

Ideally we come up with one solution that solves both. Vulkan might be more restricted in which case we should solve for Vulkan and it would solve cuda.