tinygrad: GPU EfficientNet is weirdly slow
did inference in 0.28 s
Mul : 163 29.18 ms
Add : 140 25.53 ms
Pow : 98 18.43 ms
Pad2D : 17 16.97 ms
Conv2D : 81 14.49 ms
Sigmoid : 65 10.23 ms
Reshape : 230 9.94 ms
Sub : 49 9.75 ms
AvgPool2D : 17 5.93 ms
Dot : 1 1.06 ms
Run with DEBUG=1 for profiling. Conv2D isn’t even close to the top in time users.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (11 by maintainers)
As far as I know GEMM and FFT convolutions are generally faster than nested looping. FFT has some restrictions on boundary conditions (assumes wrapping boundaries IIRC). GEMM solvers on GPU have been heavily optimized in recent years, such as the one in cuBLAS. But the implementation we have is flexible and likely the smallest.
I think the binary_ops suffer from fragmented global GPU memory accesses. The most obvious sign is that the indices are not purely based on global thread ids (they contain variable divisors). That is likely issuing 32 memreads per transaction instead of just one if it is coalesced. Global mem access has a huge latency (500-1000 cycles, at least on nvidia hardware), and non-coalesced access has a huge cost.
A naive solution to both broadcasting and memory access speed would be to fill out a full sized tensor on the cpu (before executing the kernel) with duplicated data along axes where broadcasting occurs. This obviously incurs a storage cost, but it would remove the need for variable divisors in the global memory indices, and would enable coalesced memreads. That would optimize all binary ops and give us a short road to broadcasting. There are more elegant options, but we have 11 lines to spare at the moment…
Hmm, this slots is a cool thing to learn about!
But I don’t think that’s the slowness, I suspect it’s the GPU driver. Want to profile it?
I found those mults don’t make much of a difference for speed, compiler gets them. Up to you which is more readable.
Our goal isn’t optgpufast, I think there’s some obvious reasons why it’s slow that we can fix with very few lines.
quickly from the ranks: If you are going to speed things up, maybe consider opening a new file optgpufast.py. Current implementation for readability and proof of concept, fast implementation for super fast computations. I think this would make testing, debugging and comparison of results way easier. One major benefit would be that a cpu and a slow gpu implementation remain if the fast gpu implementation is temporarily broken for some users.