vello: Slow fine rasterization (k4) on Adreno 640
I’m getting piet-gpu running on Android (see #82 for snapshot), and running into fine rasterization being considerably slower than expected. The other stages in the pipeline seem fine. I’ve done some investigation but the fundamental cause remains a mystery.
Info so far: the Adreno 640 (Pixel 4) has a default subgroup size of 64 (though it can also be set to 128 using the vk subgroup size control). That should be fine, as it means that the memory read operations from the per-tile command list are amortized over a significant number of pixels even if CHUNK (the number of pixels written per invocation of kernel4.comp) is 1 or 2. If CHUNK is 4, then the workgroup and subgroup sizes are the same; any larger value results in a partially filled subgroup. There’s more detail about the Adreno 6xx on the freedreno wiki.
There are some experiments that move the needle. Perhaps the most interesting is that commenting out the body of Cmd_BeginClip and Cmd_EndClip at least doubles the speed. This is true even when the clip commands are completely absent from the workload. My working hypothesis is that register pressure from accommodating the clip stack and other state is reducing the occupancy.
Another interesting set of experiments involves adding a per-pixel ALU load. The amount of time taken increases significantly with CHUNK. I’m not sure how to interpret that yet. Tweaking synthetic workloads like this may will be the best way to move forward, though I’d love to be able to see the asm from the shader compilation. I’m looking into the possibility of running this workload on the same (or similar hardware) but a free driver such as freedreno so that I might be able to gain more insight that way.
I’ve been using Android GPU Inspector (make sure to use at least 1.1) but so far it only gives me fairly crude metrics - things like % ALU capacity and write bandwidth scale with how much work gets done, and other metrics like % shaders busy and GPU % Utilization are high in all cases.
There are other things I’ve been able to rule out: a failure of loop unrolling by the shader compiler. Failure to account ptcl reads as dynamically uniform (I manually had one invocation read and broadcast the results, yielding no change).
I do have some ideas how to make things faster (including moving as much of the math as possible to f16), but the first order of business is understanding why it’s slow, especially when we don’t seem to be seeing similar problems on desktop GPUs.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 33 (14 by maintainers)
Commits related to this issue
- use mediump precision for kernel4 colors and areas Improves kernel4 performance for a Gio scene from ~22ms to ~15ms. Updates #83 Signed-off-by: Elias Naur <mail@eliasnaur.com> — committed to eliasnaur/piet-gpu by eliasnaur 3 years ago
- use mediump precision for kernel4 colors and areas Improves kernel4 performance for a Gio scene from ~22ms to ~15ms. Updates #83 Signed-off-by: Elias Naur <mail@eliasnaur.com> — committed to linebender/vello by eliasnaur 3 years ago
- internal/gl: implement glGetProgramBinary Useful for debugging shader compiler issues, such as those that may cause https://github.com/linebender/piet-gpu/issues/83 Signed-off-by: Elias Naur <mail@... — committed to gioui/gio by eliasnaur 3 years ago
- Implement robust dynamic memory This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic. On the shader side, failure information is more useful and f... — committed to linebender/vello by raphlinus 2 years ago
- Implement robust dynamic memory This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic. On the shader side, failure information is more useful and f... — committed to raphlinus/piet-gpu by raphlinus 2 years ago
- Implement robust dynamic memory This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic. On the shader side, failure information is more useful and f... — committed to linebender/vello by raphlinus 2 years ago
fwiw, IBO is actually “Image Buffer Object” (I kinda made that name up… it is what is used for both Images and SSBOs)
isamis texelFetch, so it goes thru the TPL1 texture cache… whereasldibdoes not (image writes are not coherent with reads that go thru the texture cache)… I believe the closed driver will try to promote image loads toisamwhen it realizes there is no potential coherency issue with writes to the same image (which could include the same image attached at a different binding point)Ok, I got envytools to run and have substantially more insight. What’s attached are disassembler output from kernel 4 in #82, and a similar version with the bodies of
Cmd_BeginClipandCmd_EndClipcommented out (noclip). These were obtained by runningpgmdump2from [freedreno/envytools], patched slightly to set gpu_id to 640 and MAX_REG to 512. These were all run using the kitchen APK as above, ie, using glGetProgramBinary. Incidentally, sometimes this output a compressed format that looks very similar to k4_pipeline as attached above (first two dwords are 000051c0 5bed9c78, while the dword at offset 0x30 in k4_pipeline is 59ed9c78, different by only one bit). If I interact with the graphical app before using the service, it seems more likely to produce uncompressed output.In any case, looking at the asm, there is indeed a difference. In the k4_noclip version (happy path), all reading from the input buffers is with the
isaminstruction. Reading the tag in particular is:But in k4_clip (sad path), while some of the reads are still
isam, the tag and the TileSeg reads are theldibinstruction instead.Doing a little reading in the freedreno code, ldib is a cat6 instruction new to the Adreno 6xx series. There’s a comment on emit_intrinsic_load_ssbo with some info, and also there’s a doc string in the xml for the instruction that reads, cryptically “LoaD IBo”. (IBO = “index buffer object” in GL-speak)
I didn’t do quite as much study of the
isaminstruction. There’s another comment which discusses changes in ssbo handling that refers to it. It’s a category 5 instruction.So, what we know with reasonably high confidence: there’s some heuristic that the compiler is using to select between
isamandldibfor reads from these buffers. For some reason,ldibseems quite a bit slower thanisam. Why, I’m not completely sure. It might have something to do with caching.I’ve skimmed over the rest of the shader code and don’t see anything unexpected or very different between the two. Register usage seems comparable. There are of course lots of opportunities for experimental error here, but I’m reasonably confident I’m looking at the actual source of the slowdown.
k4_clip.gz k4_noclip.gz
Rebasing to master (which brings in 22507de among other things) yields continued slow performance (seemingly worse than the branch I was working on). Similarly, commenting out the begin and end clip commands gives a dramatic improvement in performance (but still not as good as the snapshot). This is interesting and somewhat unexpected, as I would have expected the clip_stack[] itself to be significant. It’s not entirely unexpected, though, as one of the experiments was to increase the clip stack size, with the intent of forcing the compiler to spill it and not consume registers, and that did not significantly change performance. At least we have more variables to play with, though.
The “% Shader ALU Capacity Utilized” metric reported by AGI is around 6% for CHUNK=4 and with the clip stack in place, using #82 as the measurement baseline. It’s about double that with CHUNK=1 and no clip stack.