stable-diffusion.cpp: CUDA cannot generate images

I encountered a strange problem. After using CUDA, I got a pure green picture when running.But it works fine on another computer.

sd_cuda.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 0.92s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 455 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 415 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 876 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 7.42s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 157.10s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 157.12s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.65s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.66s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.66s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 164.66s
save result image to 'output.png'


output

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 29 (17 by maintainers)

Most upvoted comments

Just to provide another data point and a potential fix.

I have a GTX 1070 and also got images with all green pixels. The CUDA version is 12.1

As @wailovet showed above, the problem seems coming from cuda version of mul_mat. One observation is that if you run ggml’s test-conv2d case, most likely it will fail if your GPU has computation capability <= 7.5.

I suspect the culprit is in https://github.com/FSSRepo/ggml/blob/70474c6890c015b53dc10a2300ae35246cc73589/src/ggml-cuda.cu#L6953-L6979 Here src0 is converted to FP32 if it is not, but src1 is not checked and converted. If you add a similar section of code to convert src1 to FP32, test-conv2d will pass. Unfortunately my fix crashed sd although it made test-conv2d pass. I lack the skill to make a bullet-proof fix and leave that to ones who can robustly do.

I have got a fix that works. Here is the patch.

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 0d8b8d1..13e443f 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -6952,7 +6952,9 @@ inline void ggml_cuda_op_mul_mat_cublas(
     }
     else {
         float * src0_ddq_as_f32 = nullptr;
+        float * src1_ddq_as_f32 = nullptr;
         size_t src0_as = 0;
+        size_t src1_as = 0;

         if (src0->type != GGML_TYPE_F32) {
             const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src0->type);
@@ -6960,7 +6962,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             src0_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(row_diff*ne00 * sizeof(float), &src0_as); // NOLINT
             to_fp32_cuda(src0_dd_i, src0_ddq_as_f32, row_diff*ne00, stream);
         }
+        if (src1->type != GGML_TYPE_F32) {
+            // printf(" src1 is not FP32 \n");
+            const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src1->type);
+            GGML_ASSERT(to_fp32_cuda != nullptr);
+            src1_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(src1_ncols*ne10 * sizeof(float), &src1_as); // NOLINT
+            to_fp32_cuda(src1_ddf_i, src1_ddq_as_f32, src1_ncols*ne10, stream);
+        }
         const float * src0_ddf_i = src0->type == GGML_TYPE_F32 ? (const float *) src0_dd_i : src0_ddq_as_f32;
+        const float * src1_ddf1_i = src1->type == GGML_TYPE_F32 ? (const float *) src1_ddf_i : src1_ddq_as_f32;

         const float alpha = 1.0f;
         const float beta = 0.0f;
@@ -6970,12 +6980,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             cublasSgemm(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N,
                     row_diff, src1_ncols, ne10,
                     &alpha, src0_ddf_i, ne00,
-                            src1_ddf_i, ne10,
+                            src1_ddf1_i, ne10,
                     &beta,  dst_dd_i,   ldc));

         if (src0_as != 0) {
             ggml_cuda_pool_free(src0_ddq_as_f32, src0_as);
         }
+        if (src1_as != 0) {
+            ggml_cuda_pool_free(src1_ddq_as_f32, src1_as);
+        }
     }

Anyone with old NVIDIA GPUs can give a try. It also fixes two test cases: test-conv1d and test-conv2d

I’m not very experienced in CUDA; in fact, I’m struggling to add some features that could significantly accelerate image generation speed in CUDA. However, I’m facing many issues due to my lack of understanding in GPU engineering, so I can’t shed light on the matter. I’m sorry that it’s not working for some people. If I had equivalent hardware for testing, perhaps I could be of assistance.

@wailovet @bssrdf @SmallAndSoft I’ve updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.

I tried enabling taesd,got the result output

Running it on another laptop of mine can generate normal images and the efficiency is significantly improved.

Differences in “result”

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09 

Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?

ggml_tensor* postive = sd->get_learned_condition(work_ctx, prompt);
CPU output: -0.387249 0.0171568 -0.054192 -0.183599 -0.0261911 -0.338466 -0.0235674 -0.187387 0.186605 -0.0903851 CUDA output: -0.387245 0.0171541 -0.0541848 -0.18359 -0.026197 -0.338474 -0.0235705 -0.187385 0.186602 -0.0903773 Maybe there are some subtle differences, but I think the impact should be minimal

Here is my check of the output

image postive:✔️ image negative:✔️

struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); im2col :✔️ struct ggml_tensor * mma = ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]); mma :✔️ struct ggml_tensor * mmb = ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3]); mmb :✔️ struct ggml_tensor * result = ggml_mul_mat(ctx, mma, mmb); result:❌

@wailovet @bssrdf @SmallAndSoft I’ve updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

I tried the execution result of cuda and everything is fine Thank you very much!

@leejet That fixed the issue for my GTX 1060. Thank you very much!

I’ve attempted to update this branch https://github.com/leejet/stable-diffusion.cpp/pull/134 to the latest ggml, but encountered some issues when generating images larger than 512x512. I haven’t had time to pinpoint the exact cause yet.

Once the upstream ggml merges your PR, I’ll update ggml to the corresponding commit to fix this issue.

That’ll be great! Glad I can finally try SD with the generation old 1070. Still, it is much faster than CPU 😄

Once the upstream ggml merges your PR, I’ll update ggml to the corresponding commit to fix this issue.

It doesn’t look wrong. How could this happen?

I’m confused too, I tried using llama.cpp and it worked fine too. Maybe I should buy a new GPU

This is usually caused by insufficient GPU memory.