diffusers: Potential regression in deterministic outputs

Describe the bug

I’ve started noticing different outputs in the latest version of diffusers starting from diffusers 0.4.0 when compared against 0.3.0. This is my test code (extracted from a notebook):

import diffusers
from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch
from IPython.display import display

def run_tests(pipe):
    torch.manual_seed(1000)
    display(pipe("A photo of Barack Obama smiling with a big grin").images[0])
    torch.manual_seed(1000)
    display(pipe("Labrador in the style of Vermeer").images[0])

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
run_tests(pipe)

The first prompt produces identical results. The second one, however, results in different outputs:

0.3.0 labrador_0 3

main@a3efa433eac5feba842350c38a1db29244963fb5 labrador_0 6

Using DDIM, both prompts generate different images.

scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=scheduler)
pipe = pipe.to("cuda")
run_tests(pipe)

DDIM 0.3.0 obama_ddim_0 3

DDIM main obama_ddim_0 6

DDIM 0.3.0 labrador_ddim_0 3

DDIM main labrador_ddim_0 6

In addition, there’s this post from a forum user with very different results in the img2img pipeline: https://discuss.huggingface.co/t/notable-differences-between-other-implementations-of-stable-diffusion-particularly-in-the-img2img-pipeline/24635/5. They opened another issue recently #901. Cross-referencing, may or may not be related to this issue.

Reproduction

As explained above.

Logs

No response

System Info

diffusers: main @ a3efa433eac5feba842350c38a1db29244963fb5 vs v0.3.0

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 20 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Once the pipeline tests are fully updated we should also make a doc explaining the problem with reproducibility in general with diffusion models. cc @anton-l

Small update here:

  • 1.) We now know that we cannot guarantee reproducibility (only loosely “close” reproducibility) because of https://github.com/pytorch/pytorch/issues/87992 => therefore we can never really guarantee that the exact same images are generated across devices
  • 2.) I checked and I cannot reproduce difference of this code:
import diffusers
from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch
from IPython.display import display

def run_tests(pipe):
    torch.manual_seed(1000)
    display(pipe("A photo of Barack Obama smiling with a big grin").images[0])
    torch.manual_seed(1000)
    display(pipe("Labrador in the style of Vermeer").images[0])

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
run_tests(pipe)

between 0.3.0 and 0.7.0dev using a V100

Overall this issue to me now seems much less severe than originally and a bit part of it is probably simply to “uncontrollable” randomness

Next:

  • Add aggressive scheduler tests and check differences between 0.3.0 and 0.7.0dev
  • Add aggressive minimal step pipeline tests and check differences between 0.3.0 and 0.7.0dev

Just a bit curious:

I checked and I cannot reproduce difference of this code:

What kind of difference you are checking/looking here, @patrickvonplaten ?

Well, if you mean there is no visual difference, there would still be numerical difference, as I have found in the analysis. I think it would still be a good idea to record when such difference occurs among commits (or on a daily basis), so we can track them easily. But just a suggestion.

Update: 0.4.0 seems to suffer from the same behavior as 0.6.0.