whisper.cpp: Diminishing returns with increasing number of threads

It seems like 7 threads is a sweet-spot after which performance starts decreasing:

cpp

Is this expected?

Latest build from the GitHub Workflows
Windows 21H2
AMD 3700X

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 2
  • Comments: 16 (3 by maintainers)

Most upvoted comments

@jonvaldes

Thanks for this analysis! I guess I will have to make the threads wait on a condition variable instead of joining them when the ggml_graph_compute finishes.

Regarding the atomic_load - once the threads are started, I found that using a busy loop on atomic counter is much more efficient compared to waiting + notify a condition variable. Of course, it is probably more energy wasteful, but since I am more interested in performance it was better. I think I can add a “low-power” mode where instead of busy loops we use the standard mechanism with condition variable. Would make the CPU go less crazy.

@ggerganov I profile it with FlameGraph, on my linux host. with thread 8, you can see that ggml_compute_forward_mul_mat only use about 24.71% cpu time, but 72.53%(97.24 - 24.71) cpu time is wasted, I suspect this is the reason why metal don’t work as expect, it's not the bottleneck. whisper02

I’m not familiar with C++, but from the code I guess decrease the thread number can help reduce the busy waiting time. here is the thread 4 FlameGraph, you can see that now ggml_compute_forward_mul_mat 63.21% is doing actual work, only 32.19% (95.4 - 63.21) cpu time is busy waiting, thread4

@savchenko

Yes, I observe the same behaviour on M1 Pro - 7 threads is the sweet spot. Thanks for pointing out - I actually thought that 8 threads is best.

My explanation is that the computation becomes memory-bound at some point, so you stop gaining performance with more CPU power. It’s the memory that limits us.

@RYucel , as you can see from the graph above, there is still a benefit of ~250ms from increasing number of threads from 4 to 6. Anything higher is indeed pointless.