llama.cpp: Performance issues with cuBLAS and a bug

Performance with cuBLAS isn’t there yet, it is more a burden than a speedup with llama eval in my tests. In a simple benchmark case it is absolutely amazing, getting 10 million elements multiplied in F32 goes from 1+ seconds down to 20 milliseconds. So the improvement is a blast!

But in the llama case the overhead seems to be enormous, when enabling it generically the average computation time shoots from 300ms up to 1500ms using cuBLAS. I feel like the memory should have been prepared beforehand and I don’t think the thousands of CUDA cores are used. Is that loop really the best way to do it ?

      for (int64_t i03 = 0; i03 < ne03; i03++) {
            for (int64_t i02 = 0; i02 < ne02; i02++) {
                const float * x = (float *) ((char *) src0->data + i02*nb02 + i03*nb03);
                const float * y = (float *) ((char *) src1->data + i02*nb12 + i03*nb13);
                float * d = (float *) ((char *) dst->data + i02*nb2 + i03*nb3);
                CUDA_CHECK(cudaMemcpyAsync(d_X, x, sizeof(float) * x_ne, cudaMemcpyHostToDevice, g_cudaStream));
                CUDA_CHECK(cudaMemcpyAsync(d_Y, y, sizeof(float) * y_ne, cudaMemcpyHostToDevice, g_cudaStream));
                // compute
                CUBLAS_CHECK(
                    cublasSgemm(g_cublasH, CUBLAS_OP_T, CUBLAS_OP_N,
                            ne01, ne11, ne10,
                            &alpha, d_X, ne00,
                                    d_Y, ne10,
                            &beta,  d_D, ne01));

ggml_compute_forward_mul_mat_use_blas() is evaluated too often, I think it would be better to make a generic “check” function that uses the node Operation type as input. I’m experimenting with another approach in that case, not finished yet but I think we need to calculate the flops required. Then the function is called too often (init, compute, finalize), I’m just modifying it to set a flag in the tensor instead but that’s just experimental atm.

There is a bug in ggml.c which causes the matrix multiplication being executed once for all threads in CUDA when we have a 32 bit input and output.

else if (node->src0->type == GGML_TYPE_F32 && node->src1->type == GGML_TYPE_F32) {
cur = 0;
}

I think it should rather be like this:

else if (node->src0->type == GGML_TYPE_F32 && node->src1->type == GGML_TYPE_F32) {
                            cur = 0;
                            if (ggml_compute_forward_mul_mat_use_blas(node->src0, node->src1, node)) {
                                node->n_tasks = 1;
                                cur = GGML_TYPE_SIZE[GGML_TYPE_F32]*(node->src0->ne[0]*node->src0->ne[1]);
                            }

If n_tasks is not reduced to 1 then threads are associated with the cuda one it will currently not stop each thread from using cuda (and surprisingly no crashes)

Finally: the computation with flops taking into consideration the memory time using a quick local benchmark at startup might be a good generic solution to decide if cuda should be enabled. Generically. But for our use case we know beforehand if we need it, every layer is known and sizes are known. So that flag in the tensor struct should be set manually (I currently use -1,1 and 0: -1 = generic, 1 = force cuda if possible, 0 = disable cuda) There should be a way to use the full cuda cores, I have no experience in it (yet) but I’m sure we can get the multiplication through a lot faster. It looks like the 30+attention heads are calculated in sequence, cuda can do that in one shot.

Also when cuda is enabled we need to make sure that the big matrix tensors are proper in memory for cuda, or they will be skipped. I didn’t get to that point yet, still fighting 😉

About this issue

Original URL
State: closed
Created a year ago
Comments: 29 (8 by maintainers)

Most upvoted comments

I tried using a different stream for each of the 32 mat muls, but didn’t notice any improvements in the performance. That’s with pre-initialized streams, so it shouldn’t have the overhead of creating and destroying the streams in each mat mul. However, I did observe a minor improvement in performance by doing the mat muls in f32 instead of f16. My GPU is indeed an Ampere RTX 3080, but there may be other factors. I have not looked at the difference in nsight yet (it doesn’t work under WSL2), but it may be simply a case that converting f16 to f32 on the CPU is faster than converting f32 to f16. The f16 matrix is smaller too, so there is less data to convert that way. This conversion is necessary because the original data types for the mat mul are f16 x f32, and cuBLAS cannot do that directly.

Overall, after all the changes, I am seeing a speedup of 60-70% in the perplexity times. I am going to start opening PRs with each of the changes individually to make it easier to review. Some of the changes may cause some friction, in particular implementing host memory pinning and the weight cache may require changes to ggml that are dangerously close to adding GPU specific details to the ggml interface, which is not what we want (see this comment).

slaren on Apr 27, 2023

For AMD, I can use rocprof, I tried to see if something similar is available on Nvidia, and there was just some confusing info

For high-level tracing of what’s going on in the system, NSight Systems is the way to go. It will give you a trace very similar to what rocprof provides, with CPU and GPU activity at microsecond-level granularity.

With lots of kernels, looking at the trace can get overwhelming, so you might want to add some markers/ranges to pinpoint the problematic MUL operations.

dfyz on Apr 26, 2023

If there is interest in that, I can implement it properly and open a PR, however do it in a safe way in ggml we would need #1093 to be able to tag the matrices that are constant and can be cached in the GPU safely. Currently, I am just assuming that when multiplying q x f32, the quantized matrix is the weights and therefore constant and safe to cache, but that may not always be the case for other uses of ggml.

slaren on Apr 22, 2023

I have tried implementing a cache for the quantized matrices (ie. the weights), and even in the best case where the entire model fits in VRAM, for me this is only ~30% faster in the perplexity test. Considering how much more VRAM this requires, I am not sure that it is worth it

Anyway, if anybody wants to give it a try, it is at https://github.com/slaren/llama.cpp/commit/5a606d50e9af10fb9fbe4a602ff9101e453bc7ad.

slaren on Apr 22, 2023

What about converting f16→f32 on the GPU? it would be less data to copy.

SlyEcho on Apr 27, 2023

I found some inefficiencies in the current master, there was a copy that could be made in parallel with the dequantize.

In branch cuda-cache, using pinned memory, an additional stream to copy and dequantize at the same time, and caching the weights, it is currently like this:

Overall this brings perplexity time from 40m to 25m, so it’s a nice improvement.

slaren on Apr 26, 2023