llama.cpp: CTX Processing regression for Pascal - Commit 2b4ea35
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
There is a regression on Context processing introduced in commit https://github.com/ggerganov/llama.cpp/commit/2b4ea35e56792064598e922e46d081e02bc96b94
This is specifically for Pascal (6.1) with 1/64th fp16 performance. Problem is worse with longer CTX, getting up to 6x slower by 8kCTX
Device 0: Tesla P40, compute capability 6.1
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | pp 512 | 485.03 ± 0.34 |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | tg 128 | 18.30 ± 0.00 |
build: daab3d7 (1421)
Current Behavior
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | pp 512 | 207.34 ± 0.28 |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | tg 128 | 18.28 ± 0.01 |
build: 2b4ea35 (1422)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1
| model | size | params | backend | ngl | threads | main_gpu | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ---------- | ---------------: |
warning: cannot set main_device=1 because there are only 1 devices. Using device 0 instead.
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | 1 | pp 512 | 208.54 ± 0.58 |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | 1 | tg 128 | 18.29 ± 0.00 |
build: 207b519 (1446)
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
-
5800X + 64GB DDR 3733
-
3060ti (8GB) + TESLA P40 (24GB)
-
Operating System, e.g. for Linux: Windows 11
-
SDK version, : MSVC 2022
$ python3 -- 3.10.11
$ Cmake --version 3.27.4
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 5
- Comments: 18 (3 by maintainers)
Alright, thanks to Slaren I was able to fix the problem. The issue was that I was not compiling it with AVX2 support, as I have assumed that’s just enabled by default, which it isn’t anymore.
Performance is great as expected with AVX2. Case closed!
Sounds like the same issue as mine (https://github.com/ggerganov/llama.cpp/issues/3780)
Thank you, these are overall inline with my expectations.
I think you might be reading this wrong. From what I see, the new build is ~39 t/s a bit faster than the old one even for TG when the model is fully offloaded. This is nice to see, although it deviates from my expectation for slight regression at short TG 128 tests. In any case, I believe you will see even bigger gains with the new build when the context is large.
The TG regression (-8.8%) with partial offloading is expected due to the fewer layers, but at least the PP got some non-negligible improvement (+19.0%). In the future, I think we will compensate for this as I explained in earlier comment
@LostRuins I think you mentioned earlier that for full offload, the new version on RTX 2060 is faster compared to MMQ and that you observe a regression for not-fully offloaded models due to 1-2 GPU layers less.
How big is the latter regression? Is it a regression both for short and long (>1024) contexts? If you can post some numbers for PP and TG would be helpful to get a sense of the impact of the change