diffusers: DDIM produces incorrect samples with SDXL (epsilon or v-prediction)

Describe the bug

When generating images with SDXL and DDIM, there is some residual noise in the outputs.

This leads to a “smudgy” look, and in cases where fewer steps are used, DDIM and Euler diverge a lot more than they should because of the cumulative impact of not having the timesteps aligned properly.

In some brief tests, it looks like simply adding an extra timestep with a zero sigma to the end of the schedule resolves the problem.

Reproduction

This script uses a modified Euler scheduler to create fully-denoised images:

import PIL
import requests
import torch
import numpy as np
from diffusers import StableDiffusionXLPipeline, EulerDiscreteScheduler

model_id = "ptx0/terminus-xl-gamma-training"
pipe = StableDiffusionXLPipeline.from_pretrained(model_id, add_watermarker=False, torch_dtype=torch.bfloat16).to("cuda")
generator = torch.Generator("cuda").manual_seed(420420420)

prompt = "the artful dodger, cool dog in sunglasses sitting on a recliner in the dark, with the white noise reflecting on his sunglasses"
num_inference_steps = 30
guidance_scale = 7.5
def rescale_zero_terminal_snr_sigmas(sigmas):
    sigmas = sigmas.flip(0)
    alphas_cumprod = 1 / ((sigmas * sigmas) + 1)
    alphas_bar_sqrt = alphas_cumprod.sqrt()

    # Store old values.
    alphas_bar_sqrt_0 = alphas_bar_sqrt[0].clone()
    alphas_bar_sqrt_T = alphas_bar_sqrt[-1].clone()

    # Shift so the last timestep is zero.
    alphas_bar_sqrt -= (alphas_bar_sqrt_T)

    # Scale so the first timestep is back to the old value.
    alphas_bar_sqrt *= alphas_bar_sqrt_0 / (alphas_bar_sqrt_0 - alphas_bar_sqrt_T)

    # Convert alphas_bar_sqrt to betas
    alphas_bar = alphas_bar_sqrt**2  # Revert sqrt
    alphas_bar[-1] = 4.8973451890853435e-08
    sigmas = ((1 - alphas_bar) / alphas_bar) ** 0.5
    return sigmas.flip(0)


zsnr = getattr(pipe.scheduler.config, 'rescale_betas_zero_snr', False)
pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
if zsnr:
    tsbase = pipe.scheduler.set_timesteps
    def tspatch(*args, **kwargs):
        tsbase(*args, **kwargs)
        pipe.scheduler.sigmas = rescale_zero_terminal_snr_sigmas(pipe.scheduler.sigmas)
    pipe.scheduler.set_timesteps = tspatch
    sigmas = pipe.scheduler.betas

edited_image = pipe(
   prompt=prompt, 
   num_inference_steps=num_inference_steps, 
   guidance_scale=guidance_scale,
   generator=generator,
    guidance_rescale=0.7
).images[0]
edited_image.save("edited_image.png")

It uses the Sigmas code ported by @Beinsezii in #6024 image

However, with vanilla DDIM, the results are far worse:

import PIL
import requests
import torch
import numpy as np
from diffusers import StableDiffusionXLPipeline

model_id = "ptx0/terminus-xl-gamma-training"
pipe = StableDiffusionXLPipeline.from_pretrained(model_id, add_watermarker=False, torch_dtype=torch.bfloat16).to("cuda")
generator = torch.Generator("cuda").manual_seed(420420420)

prompt = "the artful dodger, cool dog in sunglasses sitting on a recliner in the dark, with the white noise reflecting on his sunglasses"
num_inference_steps = 30
guidance_scale = 7.5
edited_image = pipe(
   prompt=prompt, 
   num_inference_steps=num_inference_steps, 
   guidance_scale=guidance_scale,
   generator=generator,
    guidance_rescale=0.7
).images[0]
edited_image.save("edited_image.png")

image

Logs

No response

System Info

  • diffusers version: 0.21.4
  • Platform: Linux-5.19.0-45-generic-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • PyTorch version (GPU?): 2.1.0+cu118 (True)
  • Huggingface_hub version: 0.16.4
  • Transformers version: 4.30.2
  • Accelerate version: 0.18.0
  • xFormers version: 0.0.22.post4+cu118
  • Using GPU in script?: A100-80G PCIe
  • Using distributed or parallel set-up in script?: FALSE

Who can help?

@patrickvonplaten @yiyixuxu

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 37 (28 by maintainers)

Most upvoted comments

So doing some more in-depth research I actually feel there’s a few issues going on simultaneously.

1.

The official SAI config for SDXL has set_alpha_to_one: False despite using EulerDiscreteScheduler

So if you inherit the config such as

pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)

It’ll inherit the set_alpha_to_one=False value which is normally enabled by default in Diffusers.

2.

The actual diffusers documentation incorrectly specifies an set_alpha_to_one value for Euler, DDPM

…And probably more, as the documentation for the step_offset kwarg is blindly copy-pasted across all the schedulers that have the option, explicitly stating that

You can use a combination of offset=1 and set_alpha_to_one=False to make the last step use step 0 for the previous alpha product like in Stable Diffusion

However, a cursory glance shows that this basically only applies to DDIM and maybe a few others. Euler, DDPM, DPM Multistep, etc. all do not contain a set_alpha_to_one kwarg so the value is silently dropped until someone inherits the configuration later.

3.

Why recommend steps_offset=1 and set_alpha_to_one=False?

The documentation implies that step_offset=1 and set_alpha_to_one=True are contradictory solutions to a final timestep problem, however (on SDXL) it seems to not be the case at all. steps_offset=1 changes the image very slightly and set_alpha_to_one=False just adds residual noise regardless of the offset as the final cumprod is clamped down to the maximum index which is something like 0.997 instead of 1.0

Additionally, steps_offset=1 doesn’t even apply to the trailing timestep spacing that most of the major UIs use nowadays, including ComfyUI which is more or less SAI’s reference implementation based on their usage in Discord. With leading the difference is still so small I’m wondering what the point even is. Maybe it’s only useful for the older SD pipelines?

The following figure of SDXL-Base images contains two rows: leading and trailing timesteps with the following columns

  1. DDIMScheduler(steps_offset=0, set_alpha_to_one=False)
  2. DDIMScheduler(steps_offset=1, set_alpha_to_one=False)
  3. DDIMScheduler(steps_offset=0, set_alpha_to_one=True)
  4. DDIMScheduler(steps_offset=1, set_alpha_to_one=True)
  5. EulerDiscreteScheduler(steps_offset=0)
  6. EulerDiscreteScheduler(steps_offset=1)

samplers

It’s extremely obvious that the issue plaguing @bghira and myself is the set_alpha_to_one=False being inherited from the XL base config. Additionally, I’m not convinced that steps_offset=1 is ever exclusively preferable as a replacement for set_alpha_to_one=True (or even at all, really), as it performs identically regardless of whether the final cumrpod is overridden or not.

How did we get here?

My theory based on the above is that SAI initially opted to use DDIMScheduler for SDXL on Huggingface, as that’s the scheduler in their paper. They set steps_offset=1 and set_alpha_to_one=False per recommendation from the Diffusers documentation as that’s the setup “like in Stable Diffusion”, but after dealing with horribly noisy images they just changed the scheduler to EulerDiscreteScheduler in their config and left the rest as-is because it seemed to work (and the documentation implies set_alpha_to_one=False should still be used on Euler despite not existing), so now the set_alpha_to_one=False left over in their scheduler config will pollute downstream uses that inherit from the XL base config.

So, how do we fix? To be honest, I’m not 100% sure.

Recommend timestep_spacing="trailing" instead of steps_offset=1, set_alpha_to_one=False?

Maybe @patrickvonplaten or @yiyixuxu have a good solution?

No matter what happens with set_alpha_to_one and steps_offset, the documentation on all the schedulers needs some refactoring. People shouldn’t be setting kwargs that don’t exist.

@bghira

if you wanted to simply remove ddim i would think thats fine

DDIM is still used a lot, no? just not a popular choice with SDXL I think. maybe we can add a note in our doc?

ideally it would be mapped to euler so that the behaviour remains the same for end users. ComfyUI just did this a few months back to reduce duplicate code maintenance overhead as well.

the red dots are the “invisible” watermarker. #4014

I am using:

class NoWatermark:
    def apply_watermark(self, img):
        return img
...
pipe.watermarker = NoWatermark

edit: Oh, I see.

class NoWatermark:
    def apply_watermark(self, img):
        return img
...
- pipe.watermarker = NoWatermark
+ pipe.watermark = NoWatermark

Now there is still some noise, but it is reduced.

GhnWtPEZJz0r2

I thought ddim had incorrect samples regardless of ZSNR or not. If the solution is to simply use euler and leave ddim broken then it may as well be deprecated.

The fact that euler needs an extra 0 sigma to avoid the residual noise issue and DPM has such options as euler_at_final leads me to believe there’s a bigger problem with how the samplers are called, so either ddim and the rest all need Band-Aids or that off-by-one issue or whatever it is needs to be found.

@yiyixuxu could you take a look here?