stable-diffusion.cpp: CUDA cannot generate images

I encountered a strange problem. After using CUDA, I got a pure green picture when running.But it works fine on another computer.

sd_cuda.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 0.92s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 455 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 415 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 876 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 7.42s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 157.10s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 157.12s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.65s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.66s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.66s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 164.66s
save result image to 'output.png'

output

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 29 (17 by maintainers)

Most upvoted comments

Just to provide another data point and a potential fix.

I have a GTX 1070 and also got images with all green pixels. The CUDA version is 12.1

As @wailovet showed above, the problem seems coming from cuda version of mul_mat. One observation is that if you run ggml’s test-conv2d case, most likely it will fail if your GPU has computation capability <= 7.5.

I suspect the culprit is in https://github.com/FSSRepo/ggml/blob/70474c6890c015b53dc10a2300ae35246cc73589/src/ggml-cuda.cu#L6953-L6979 Here src0 is converted to FP32 if it is not, but src1 is not checked and converted. If you add a similar section of code to convert src1 to FP32, test-conv2d will pass. ~~Unfortunately my fix crashed sd although it made test-conv2d pass. I lack the skill to make a bullet-proof fix and leave that to ones who can robustly do.~~

I have got a fix that works. Here is the patch.

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 0d8b8d1..13e443f 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -6952,7 +6952,9 @@ inline void ggml_cuda_op_mul_mat_cublas(
     }
     else {
         float * src0_ddq_as_f32 = nullptr;
+        float * src1_ddq_as_f32 = nullptr;
         size_t src0_as = 0;
+        size_t src1_as = 0;

         if (src0->type != GGML_TYPE_F32) {
             const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src0->type);
@@ -6960,7 +6962,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             src0_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(row_diff*ne00 * sizeof(float), &src0_as); // NOLINT
             to_fp32_cuda(src0_dd_i, src0_ddq_as_f32, row_diff*ne00, stream);
         }
+        if (src1->type != GGML_TYPE_F32) {
+            // printf(" src1 is not FP32 \n");
+            const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src1->type);
+            GGML_ASSERT(to_fp32_cuda != nullptr);
+            src1_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(src1_ncols*ne10 * sizeof(float), &src1_as); // NOLINT
+            to_fp32_cuda(src1_ddf_i, src1_ddq_as_f32, src1_ncols*ne10, stream);
+        }
         const float * src0_ddf_i = src0->type == GGML_TYPE_F32 ? (const float *) src0_dd_i : src0_ddq_as_f32;
+        const float * src1_ddf1_i = src1->type == GGML_TYPE_F32 ? (const float *) src1_ddf_i : src1_ddq_as_f32;

         const float alpha = 1.0f;
         const float beta = 0.0f;
@@ -6970,12 +6980,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             cublasSgemm(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N,
                     row_diff, src1_ncols, ne10,
                     &alpha, src0_ddf_i, ne00,
-                            src1_ddf_i, ne10,
+                            src1_ddf1_i, ne10,
                     &beta,  dst_dd_i,   ldc));

         if (src0_as != 0) {
             ggml_cuda_pool_free(src0_ddq_as_f32, src0_as);
         }
+        if (src1_as != 0) {
+            ggml_cuda_pool_free(src1_ddq_as_f32, src1_as);
+        }
     }

Anyone with old NVIDIA GPUs can give a try. It also fixes two test cases: test-conv1d and test-conv2d

bssrdf on Dec 28, 2023

I’m not very experienced in CUDA; in fact, I’m struggling to add some features that could significantly accelerate image generation speed in CUDA. However, I’m facing many issues due to my lack of understanding in GPU engineering, so I can’t shed light on the matter. I’m sorry that it’s not working for some people. If I had equivalent hardware for testing, perhaps I could be of assistance.

FSSRepo on Dec 11, 2023

@wailovet @bssrdf @SmallAndSoft I’ve updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

leejet on Jan 5, 2024

You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.

I tried enabling taesd,got the result output

Running it on another laptop of mine can generate normal images and the efficiency is significantly improved.

wailovet on Dec 1, 2023

Differences in “result”

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09

Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?

ggml_tensor* postive = sd->get_learned_condition(work_ctx, prompt);
CPU output: -0.387249 0.0171568 -0.054192 -0.183599 -0.0261911 -0.338466 -0.0235674 -0.187387 0.186605 -0.0903851 CUDA output: -0.387245 0.0171541 -0.0541848 -0.18359 -0.026197 -0.338474 -0.0235705 -0.187385 0.186602 -0.0903773 Maybe there are some subtle differences, but I think the impact should be minimal

Here is my check of the output

postive:✔️ negative:✔️

struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); im2col :✔️ struct ggml_tensor * mma = ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]); mma :✔️ struct ggml_tensor * mmb = ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3]); mmb :✔️ struct ggml_tensor * result = ggml_mul_mat(ctx, mma, mmb); result:❌

wailovet on Dec 1, 2023

@wailovet @bssrdf @SmallAndSoft I’ve updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

I tried the execution result of cuda and everything is fine Thank you very much!

wailovet on Jan 8, 2024

@leejet That fixed the issue for my GTX 1060. Thank you very much!

SmallAndSoft on Jan 5, 2024

I’ve attempted to update this branch https://github.com/leejet/stable-diffusion.cpp/pull/134 to the latest ggml, but encountered some issues when generating images larger than 512x512. I haven’t had time to pinpoint the exact cause yet.

leejet on Jan 3, 2024

Once the upstream ggml merges your PR, I’ll update ggml to the corresponding commit to fix this issue.

That’ll be great! Glad I can finally try SD with the generation old 1070. Still, it is much faster than CPU 😄

bssrdf on Dec 28, 2023

Once the upstream ggml merges your PR, I’ll update ggml to the corresponding commit to fix this issue.

leejet on Dec 28, 2023

It doesn’t look wrong. How could this happen?

I’m confused too, I tried using llama.cpp and it worked fine too. Maybe I should buy a new GPU

wailovet on Dec 6, 2023

This is usually caused by insufficient GPU memory.

Cyberhan123 on Dec 6, 2023