llama.cpp: Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k)

I’ve not digged deep into this yet but my whole CPU utilization is only at 50%. I’ve compiled it with current VS build tools, all default, release mode of course.

It might be related to the modern e-cores in Intel CPUs, they pack quite a punch but are weaker than performance cores. In the graph it looks like 16 cores (the amount of e-cores) are much more utilized and 8 cores (amount of performance cores) are mostly idle despite using 24 threads. Increasing threads worsens performance, decreasing threads worsens tokens output.

I tested the small 7B model in 4 bit and 16 bit. The only method to get CPU utilization above 50% is by using more than the total physical cores (like 32 cores). In this case I see up to 99% CPU utilization but the token performance drops below 2 cores performance, some hyperthreading issue I suppose. I tried various modes (small/large batch size, context size) It all does not influence it much.

The CPU was idle (as seen in screenshot). Also memory is not full or swapping either.

Here is the command line: .\Release\main.exe -m .\models\7B\ggml-model-f16.bin -p "Below I count from 1 to 100000000: 1 2 3 4 5 6" -c 1024 -t 24 -n 1024 -b 64

system_info: n_threads = 24 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 1024, n_batch = 64, n_predict = 1024, n_keep = 0

image

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (13 by maintainers)

Commits related to this issue

Most upvoted comments

When the thread launches the OS scheduler has no clue what that thread is going to do.

Yeah I thought about this too that how on earth can the OS scheluder can know whether to assign a thread to a P-core or a E-core before actually doing the work. There are some information here How 13th Gen Intel® Core™ Processors Work on how the “behind-the-scenes magic that maximizes hybrid performance” works, namely the symbiotic relationship between OS scheluder and “Intel Thread Director”, and I also remember “advanced AI” being thrown around in some other article too, whatever that means.

I’m doubtful the entire approach works.

It should work though, the ‘cpuid’ returns information about the logical processor the instruction was executed on. So yeah, it cannot be used to know beforehand whether a thread is going to be executed on a P/E processor, that’s affinity which you can force, but it can determine the current state, meaning while the thread is already running, after the work is sent by the scheluder to a logical processor. So the functions posted are useful for information’s sake, not actually changing what is happening. It can show whether an affinity lock actually worked, and also how the threads are being assigned when not locking the affinity.

btw here is the non ASM approach:

The reason to use the ASM approach was to support all and any compilers as I’ve found it frustrating that there are multiple implementations depending on the compiler, like they couldn’t decide on a standard for everyone and had to make up their own. Especially if you need other sub-leafs than 0, there are even more different intrinsics to cover, they aren’t required in this particular case though.

The performance benefit of making it in the least amount of instructions possible saves just a few instructions and dozen bytes of memory at best, so it wasn’t really any consideration.

As in my single-header (incomplete) cross-compiler feature flag library cpuid.h , I simply found making the intrinsincs from scratch easier than to cover and test every possible compiler (and version). Using simple inline assembly makes sure that the code is always the same and doesn’t allow for any compiler confusions.

The compiler intrinsics will be translated to assembly at compile time anyway, so there isn’t really a difference. But a non-ASM approach should be fine too, it just needs some more #ifdef’s and work to support. Other than that they’re fundamentally identical. Intrinsics do have the upside of better readability, and not having to add a explanatory comment to the code of what is being done with the ASM.

When I implemented it into the thread worker it actually killed the worker, just executing the opcode sequence caused it to stall, even when ignoring the result. Debugging why that happens would be painful.

Could you drop the stalling compiled binary as an attachment or put it to a google drive/dropbox/whatever as I would be interested in taking a look. Maybe add a printf(“HEY , OVER HERE”) just before the call it so I can find it quickly.

Here are the best options I found for Intel 13th core: https://github.com/ggerganov/llama.cpp/discussions/229#discussioncomment-5454503:

I tinkered a bit and is what seemed the best on my i5 13500:

  • Switch from OpenBLAS to Intel openAPI MKL’s BLAS implementation
  • 6 threads for ggml (this CPU has 6 performance core)
  • 8 threads for OPL (this CPU has 8 efficiency cores)

Using a single ggml thread with 5 BLAS threads on the 5 other performance cores proceeds quite well, but of course inference is slow. It would be great to be able to set the ggml / BLAS threads counts differently depending if it is initial prompt ingestion or inference.

Using more than 6 ggml threads is very slow, I believe that the efficiency cores are bottlenecking.

I opened a PR to OpenBLAS to improve the issue I had with it on Intel 13th gen: https://github.com/xianyi/OpenBLAS/pull/3970 Make sure to compile the latest version of OpenBLAS with this PR if on i5 13500.

You could try adding this piece of code https://github.com/ggerganov/llama.cpp/discussions/572#discussioncomment-5456823 to determine whether a given thread is running on a P or E core. (Note that the snippet shouldn’t be used in other than Intel 12th/13th gen since I didn’t include the necessary code to check the platform, only the P/E core part. If such a thing were to be added to master, it would be trivial to extend it to add the platform check.)

You could also lock the thread affinity for the process to only use P-cores in the task manager and see if that improves performance as that discussion thread suggests. While the point of the P/E arch was that the intel thread director working in tandem with the os thread scheduler should know better which one to use for a given task, it seems that in reality it doesn’t always know that. It also doesn’t help that all the windows’s after 7 are garbage and want to eat your resources in the background. But maybe locking apps to the perf cores would make the background crap threads run only in the E-cores and keep the P-cores free for actual work.

Also if you have a AVX512 capable CPU you could try enabling it (which also disables the E-cores afaik) and see if AVX512 increases performance even further compared to AVX2.

If there is a performance increase it could be a good idea to add it as an compile-flag #ifdef ENABLE_INTEL_FORCE_P_CORE along with a option like --force-p-core or something using pthread_setaffinity_np on unixes and SetThreadAffinityMask on windows.