whisper.cpp: Encoder is broken when CUBLAS is ON

This occurs when using the tiny, small, base, medium, and large models. All models used are not quantized.

    ggml_tensor * tensor = wctx.state->embd_conv;
    std::vector<float> tensor_data(ggml_nelements(tensor));
    ggml_backend_tensor_get(tensor, tensor_data.data(), 0, ggml_nbytes(tensor));
    std::ofstream outFile("encoder_embedding_conv.json");
    outFile << "[";
    for (uint64_t i = 0; i < tensor_data.size() - 1; i++) {
        outFile << tensor_data[i] << ", ";
    }
    outFile << tensor_data[tensor_data.size() - 1] << "]";
    outFile.close();
    return 0;

CUDA:

CPU:

encoder_embedding_conv.zip

About this issue

Original URL
State: closed
Created 6 months ago
Reactions: 1
Comments: 23 (3 by maintainers)

Most upvoted comments

The CUDA backend is always used automatically with large matrix multiplications. At the moment, the only way to disable it completely is to build without CUDA.

https://github.com/ggerganov/whisper.cpp/blob/37a709f6558c6d9783199e2b8cbb136e1c41d346/ggml-cuda.cu#L8243-L8246

slaren on Dec 27, 2023

All of these need to be true to use FP16 matrix multiplication:

compute_capability >= CC_VOLTA
(src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type))
ggml_is_contiguous(src0)
row_diff == src0->ne[1]
dst->op_params[0] == GGML_PREC_DEFAULT

Note that the || is inside a parenthesis.

slaren on Dec 27, 2023

We would need to find the op that is producing wrong results in CUDA. The easiest way to do this is by using ggml_backend_compare_graph_backend to run the graph both on the CPU and in CUDA at the same time and compare the results. test-backend-ops shows how to do this. If you already know or suspect what op may be the issue, then you can add a test case in test-backend-ops to confirm it.

slaren on Dec 26, 2023