whisper.cpp: Encoder is broken when CUBLAS is ON
This occurs when using the tiny, small, base, medium, and large models.
All models used are not quantized.
ggml_tensor * tensor = wctx.state->embd_conv;
std::vector<float> tensor_data(ggml_nelements(tensor));
ggml_backend_tensor_get(tensor, tensor_data.data(), 0, ggml_nbytes(tensor));
std::ofstream outFile("encoder_embedding_conv.json");
outFile << "[";
for (uint64_t i = 0; i < tensor_data.size() - 1; i++) {
outFile << tensor_data[i] << ", ";
}
outFile << tensor_data[tensor_data.size() - 1] << "]";
outFile.close();
return 0;
CUDA:
CPU:
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Reactions: 1
- Comments: 23 (3 by maintainers)
The CUDA backend is always used automatically with large matrix multiplications. At the moment, the only way to disable it completely is to build without CUDA.
https://github.com/ggerganov/whisper.cpp/blob/37a709f6558c6d9783199e2b8cbb136e1c41d346/ggml-cuda.cu#L8243-L8246
All of these need to be true to use FP16 matrix multiplication:
compute_capability >= CC_VOLTA(src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type))ggml_is_contiguous(src0)row_diff == src0->ne[1]dst->op_params[0] == GGML_PREC_DEFAULTNote that the
||is inside a parenthesis.We would need to find the op that is producing wrong results in CUDA. The easiest way to do this is by using
ggml_backend_compare_graph_backendto run the graph both on the CPU and in CUDA at the same time and compare the results.test-backend-opsshows how to do this. If you already know or suspect what op may be the issue, then you can add a test case intest-backend-opsto confirm it.