llama.cpp: CTX Processing regression for Pascal - Commit 2b4ea35

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

There is a regression on Context processing introduced in commit https://github.com/ggerganov/llama.cpp/commit/2b4ea35e56792064598e922e46d081e02bc96b94

This is specifically for Pascal (6.1) with 1/64th fp16 performance. Problem is worse with longer CTX, getting up to 6x slower by 8kCTX

  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    485.03 ± 0.34 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.30 ± 0.00 |

build: daab3d7 (1421)

Current Behavior

ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    207.34 ± 0.28 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.28 ± 0.01 |

build: 2b4ea35 (1422)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads |   main_gpu | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ---------- | ---------------: |
warning: cannot set main_device=1 because there are only 1 devices. Using device 0 instead.
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 |          1 | pp 512     |    208.54 ± 0.58 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 |          1 | tg 128     |     18.29 ± 0.00 |

build: 207b519 (1446)

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

5800X + 64GB DDR 3733
3060ti (8GB) + TESLA P40 (24GB)
Operating System, e.g. for Linux: Windows 11
SDK version, : MSVC 2022

$ python3 -- 3.10.11
$ Cmake --version 3.27.4

@LostRuins

About this issue

Original URL
State: closed
Created 8 months ago
Reactions: 5
Comments: 18 (3 by maintainers)

Most upvoted comments

Alright, thanks to Slaren I was able to fix the problem. The issue was that I was not compiling it with AVX2 support, as I have assumed that’s just enabled by default, which it isn’t anymore.

Performance is great as expected with AVX2. Case closed!

Dampfinchen on Nov 2, 2023

Sounds like the same issue as mine (https://github.com/ggerganov/llama.cpp/issues/3780)

quarterturn on Oct 31, 2023

Thank you, these are overall inline with my expectations.

@ggerganov strangely, even with full offload on 7B, my TG speed seems to have take a small hit too (39 to 38 t/s) though I think the difference is minor enough.

I think you might be reading this wrong. From what I see, the new build is ~39 t/s a bit faster than the old one even for TG when the model is fully offloaded. This is nice to see, although it deviates from my expectation for slight regression at short TG 128 tests. In any case, I believe you will see even bigger gains with the new build when the context is large.

The TG regression (-8.8%) with partial offloading is expected due to the fewer layers, but at least the PP got some non-negligible improvement (+19.0%). In the future, I think we will compensate for this as I explained in earlier comment

ggerganov on Nov 2, 2023

@LostRuins I think you mentioned earlier that for full offload, the new version on RTX 2060 is faster compared to MMQ and that you observe a regression for not-fully offloaded models due to 1-2 GPU layers less.

How big is the latter regression? Is it a regression both for short and long (>1024) contexts? If you can post some numbers for PP and TG would be helpful to get a sense of the impact of the change

ggerganov on Nov 2, 2023