diffusers: Potential regression in deterministic outputs

Describe the bug

I’ve started noticing different outputs ~~in the latest version of diffusers~~ starting from diffusers 0.4.0 when compared against 0.3.0. This is my test code (extracted from a notebook):

import diffusers
from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch
from IPython.display import display

def run_tests(pipe):
    torch.manual_seed(1000)
    display(pipe("A photo of Barack Obama smiling with a big grin").images[0])
    torch.manual_seed(1000)
    display(pipe("Labrador in the style of Vermeer").images[0])

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
run_tests(pipe)

The first prompt produces identical results. The second one, however, results in different outputs:

0.3.0 labrador_0 3

main@a3efa433eac5feba842350c38a1db29244963fb5 labrador_0 6

Using DDIM, both prompts generate different images.

scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=scheduler)
pipe = pipe.to("cuda")
run_tests(pipe)

DDIM 0.3.0 obama_ddim_0 3

DDIM main obama_ddim_0 6

DDIM 0.3.0 labrador_ddim_0 3

DDIM main labrador_ddim_0 6

In addition, there’s this post from a forum user with very different results in the img2img pipeline: https://discuss.huggingface.co/t/notable-differences-between-other-implementations-of-stable-diffusion-particularly-in-the-img2img-pipeline/24635/5. They opened another issue recently #901. Cross-referencing, may or may not be related to this issue.

Reproduction

As explained above.

Logs

No response

System Info

diffusers: main @ a3efa433eac5feba842350c38a1db29244963fb5 vs v0.3.0

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 20 (17 by maintainers)

Commits related to this issue

replace new model_db.json (#902) — committed to nod-ai/diffusers by dan-garvey a year ago
Revert "replace new model_db.json (#902)" (#904) This reverts commit 842adef29c642fde636cb5cd6e0e276d44aba65d. — committed to nod-ai/diffusers by powderluv a year ago

Most upvoted comments

Once the pipeline tests are fully updated we should also make a doc explaining the problem with reproducibility in general with diffusion models. cc @anton-l

patrickvonplaten on Nov 30, 2022

Small update here:

1.) We now know that we cannot guarantee reproducibility (only loosely “close” reproducibility) because of https://github.com/pytorch/pytorch/issues/87992 => therefore we can never really guarantee that the exact same images are generated across devices
2.) I checked and I cannot reproduce difference of this code:

import diffusers
from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch
from IPython.display import display

def run_tests(pipe):
    torch.manual_seed(1000)
    display(pipe("A photo of Barack Obama smiling with a big grin").images[0])
    torch.manual_seed(1000)
    display(pipe("Labrador in the style of Vermeer").images[0])

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
run_tests(pipe)

between 0.3.0 and 0.7.0dev using a V100

3.) The aggressive unittests: https://github.com/huggingface/diffusers/blob/82d56cf192f3a3c52e0708b6c8db4a6d959244dd/tests/models/test_models_unet_2d.py#L414 all pass for 0.3.0 This is good as it means our unet is not responsible for the potential regression above

Overall this issue to me now seems much less severe than originally and a bit part of it is probably simply to “uncontrollable” randomness

Add aggressive scheduler tests and check differences between 0.3.0 and 0.7.0dev
Add aggressive minimal step pipeline tests and check differences between 0.3.0 and 0.7.0dev

patrickvonplaten on Oct 31, 2022

Just a bit curious:

I checked and I cannot reproduce difference of this code:

What kind of difference you are checking/looking here, @patrickvonplaten ?

Well, if you mean there is no visual difference, there would still be numerical difference, as I have found in the analysis. I think it would still be a good idea to record when such difference occurs among commits (or on a daily basis), so we can track them easily. But just a suggestion.

ydshieh on Oct 31, 2022

Update: 0.4.0 seems to suffer from the same behavior as 0.6.0.

pcuenca on Oct 21, 2022