stable-diffusion.cpp: CUDA cannot generate images
I encountered a strange problem. After using CUDA, I got a pure green picture when running.But it works fine on another computer.
sd_cuda.exe -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
n_threads: 6
mode: txt2img
model_path: meinamix_meinaV11-f16.gguf
output_path: output.png
init_img:
prompt: 1girl
negative_prompt:
cfg_scale: 7.00
width: 512
height: 512
sample_method: euler_a
schedule: default
sample_steps: 20
strength: 0.75
rng: cuda
seed: 42
batch_count: 1
System Info:
BLAS = 1
SSE3 = 1
AVX = 1
AVX2 = 1
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 1
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO] stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv 0: sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv 1: sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv 2: sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv 3: sd.vocab.tokens arr
[INFO] stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO] stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887 - clip params backend buffer size = 236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size = 1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size = 95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO] stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO] stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 0.92s
[INFO] stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO] stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521 - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 455 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521 - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 415 ms
[INFO] stable-diffusion.cpp:4681 - get_learned_condition completed, taking 876 ms
[INFO] stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO] stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
|==================================================| 20/20 - 7.42s/it
[INFO] stable-diffusion.cpp:4706 - sampling completed, taking 157.10s
[INFO] stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 157.12s
[INFO] stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.65s
[INFO] stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.66s
[INFO] stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.66s
[INFO] stable-diffusion.cpp:4735 - txt2img completed in 164.66s
save result image to 'output.png'
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 29 (17 by maintainers)
Just to provide another data point and a potential fix.
I have a GTX 1070 and also got images with all green pixels. The CUDA version is 12.1
As @wailovet showed above, the problem seems coming from cuda version of
mul_mat
. One observation is that if you runggml
’stest-conv2d
case, most likely it will fail if your GPU has computation capability <= 7.5.I suspect the culprit is in https://github.com/FSSRepo/ggml/blob/70474c6890c015b53dc10a2300ae35246cc73589/src/ggml-cuda.cu#L6953-L6979 Here
src0
is converted to FP32 if it is not, butsrc1
is not checked and converted. If you add a similar section of code to convertsrc1
to FP32,test-conv2d
will pass.Unfortunately my fix crashedsd
although it madetest-conv2d
pass. I lack the skill to make a bullet-proof fix and leave that to ones who can robustly do.I have got a fix that works. Here is the patch.
Anyone with old NVIDIA GPUs can give a try. It also fixes two test cases:
test-conv1d
andtest-conv2d
I’m not very experienced in CUDA; in fact, I’m struggling to add some features that could significantly accelerate image generation speed in CUDA. However, I’m facing many issues due to my lack of understanding in GPU engineering, so I can’t shed light on the matter. I’m sorry that it’s not working for some people. If I had equivalent hardware for testing, perhaps I could be of assistance.
@wailovet @bssrdf @SmallAndSoft I’ve updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.
I tried enabling taesd,got the result
Running it on another laptop of mine can generate normal images and the efficiency is significantly improved.
ggml_tensor* postive = sd->get_learned_condition(work_ctx, prompt);
CPU output: -0.387249 0.0171568 -0.054192 -0.183599 -0.0261911 -0.338466 -0.0235674 -0.187387 0.186605 -0.0903851 CUDA output: -0.387245 0.0171541 -0.0541848 -0.18359 -0.026197 -0.338474 -0.0235705 -0.187385 0.186602 -0.0903773 Maybe there are some subtle differences, but I think the impact should be minimal
Here is my check of the output
postive:✔️ negative:✔️
struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); im2col :✔️ struct ggml_tensor * mma = ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]); mma :✔️ struct ggml_tensor * mmb = ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3]); mmb :✔️ struct ggml_tensor * result = ggml_mul_mat(ctx, mma, mmb); result:❌
I tried the execution result of cuda and everything is fine Thank you very much!
@leejet That fixed the issue for my GTX 1060. Thank you very much!
I’ve attempted to update this branch https://github.com/leejet/stable-diffusion.cpp/pull/134 to the latest ggml, but encountered some issues when generating images larger than 512x512. I haven’t had time to pinpoint the exact cause yet.
That’ll be great! Glad I can finally try SD with the generation old 1070. Still, it is much faster than CPU 😄
Once the upstream ggml merges your PR, I’ll update ggml to the corresponding commit to fix this issue.
I’m confused too, I tried using llama.cpp and it worked fine too. Maybe I should buy a new GPU
This is usually caused by insufficient GPU memory.