whisper.cpp: Diminishing returns with increasing number of threads
It seems like 7 threads is a sweet-spot after which performance starts decreasing:
Is this expected?
Latest build from the GitHub Workflows
Windows 21H2
AMD 3700X
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 16 (3 by maintainers)
@jonvaldes
Thanks for this analysis! I guess I will have to make the threads wait on a condition variable instead of joining them when the
ggml_graph_computefinishes.Regarding the
atomic_load- once the threads are started, I found that using a busy loop on atomic counter is much more efficient compared to waiting + notify a condition variable. Of course, it is probably more energy wasteful, but since I am more interested in performance it was better. I think I can add a “low-power” mode where instead of busy loops we use the standard mechanism with condition variable. Would make the CPU go less crazy.@ggerganov I profile it with FlameGraph, on my linux host. with thread 8, you can see that
ggml_compute_forward_mul_matonly use about 24.71% cpu time, but 72.53%(97.24 - 24.71) cpu time is wasted, I suspect this is the reason why metal don’t work as expect,it's not the bottleneck.I’m not familiar with C++, but from the code I guess decrease the thread number can help reduce the busy waiting time. here is the thread 4 FlameGraph, you can see that now
ggml_compute_forward_mul_mat63.21% is doing actual work, only 32.19% (95.4 - 63.21) cpu time is busy waiting,@savchenko
Yes, I observe the same behaviour on M1 Pro - 7 threads is the sweet spot. Thanks for pointing out - I actually thought that 8 threads is best.
My explanation is that the computation becomes memory-bound at some point, so you stop gaining performance with more CPU power. It’s the memory that limits us.
@RYucel , as you can see from the graph above, there is still a benefit of ~250ms from increasing number of threads from 4 to 6. Anything higher is indeed pointless.