DeepSpeed: [BUG]Traning multiple model with deepspeed
Describe the bug I am currently attempting to train a txt2img model (both encoder and unet) using deepspeed. I have made some modifications to the code, but I am encountering an error. The error message indicates that there may be an issue with the backward function.
To Reproduce Steps to reproduce the behavior:
ds_config = {
"train_batch_size": 24,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.01,
"betas": [args.adam_beta1, args.adam_beta2],
"weight_decay": args.adam_weight_decay,
"eps": args.adam_epsilon
}
},
"zero_optimization": {
"stage": 3,
},
# "offload_param": {
# "device": "cpu",
# "pin_memory": True,
# "buffer_count": 5,
# "buffer_size": 1e8,
# "max_in_cpu": 1e9
# },
# "offload_optimizer": {
# "device": "cpu",
# "pin_memory": True,
# "buffer_count": 4,
# "fast_init": False
# }
# "hybrid_engine": {
# "enabled": True,
# "inference_tp_size": 8,
# "release_inference_cache": False,
# "pin_parameters": True,
# "tp_gather_partition_size": 8,
# }
}
text_encoder, text_encoder_optimizer, _, _ = deepspeed.initialize(model=text_encoder, config_params=ds_config)
unet, unet_optimizer, _, _ = deepspeed.initialize(model=unet, config_params=ds_config)
for step, batch in enumerate(train_dataloader):
with accelerator.accumulate(unet):
text_encoder_optimizer.zero_grad()
unet_optimizer.zero_grad()
# Convert images to latent space
latents = vae.encode(batch["pixel_values"].to(accelerator.device, dtype=weight_dtype)).latent_dist.sample()
latents = latents * vae.config.scaling_factor
# Sample noise that we'll add to the latents
noise = torch.randn_like(latents)
bsz = latents.shape[0]
# Sample a random timestep for each image
timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (bsz,), device=latents.device)
timesteps = timesteps.long()
# Add noise to the latents according to the noise magnitude at each timestep
# (this is the forward diffusion process)
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
# Get the text embedding for conditioning
encoder_hidden_states = text_encoder(batch["input_ids"].to(accelerator.device))[0]
# Get the target for loss depending on the prediction type
if noise_scheduler.config.prediction_type == "epsilon":
target = noise
elif noise_scheduler.config.prediction_type == "v_prediction":
target = noise_scheduler.get_velocity(latents, noise, timesteps)
else:
raise ValueError(f"Unknown prediction type {noise_scheduler.config.prediction_type}")
# Predict the noise residual and compute loss
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
# text_encoder.backward(loss, retain_graph=True)
loss.backward()
# optimizer.step()
text_encoder_optimizer.step()
unet_optimizer.step()
# text_encoder_optimizer.step()
# Gather the losses across all processes for logging (if we use distributed training).
avg_loss = accelerator.gather(loss.repeat(args.train_batch_size)).mean()
train_loss += avg_loss.item() / args.gradient_accumulation_steps
the error is :
File “/mnt/dolphinfs/hdd_pool/docker/user/abc/src/diffuser/train_.py”, line 582, in train unet_optimizer.step() File “/usr/local/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py”, line 15, in wrapped_fn ret_val = func(*args, **kwargs) File “/usr/local/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py”, line 1752, in step norm_groups = self._get_norm_groups() File “/usr/local/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py”, line 15, in wrapped_fn ret_val = func(*args, **kwargs) File “/usr/local/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py”, line 1568, in _get_norm_groups norm_groups.append(self.get_grad_norm_direct(self.averaged_gradients[i], self.fp16_groups[i])) KeyError: 0
Expected behavior enable traning multiple model with deepspeed
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 3
- Comments: 15 (7 by maintainers)
@uygnef, you are correct, batch size does not matter. I have repro’d locally. Will update asap.
is there any reason I would see this if training a single model? And only occuring with fp16, bf16 and fp32 do not result in this error
@uygnef, apologies for the delay on this.
The fundamental problem is that the code breaks the gradient partitioning assumptions of zero stage 2/3. In these stages, gradients are partitioned on-the-fly during creation. The assumption is that gradient creation and partitioning of a model is triggered by the backward of the wrapping engine. However, in this case where we have two models (text_encoder and unet) with separate engines, the loss is computed based on both models forward passes. And therefore, the gradients of both models will be created by
loss.backward()
rather than their respective engine backward. Note thatengine.backward()
of one model will trigger backward and gradient creation of the other model. My investigation so far suggests that supporting this behavior in zero stage 2/3 is non-trivial effort.My colleagues suggested some alternatives, hopefully one is suitable for your use case:
FYI, @minjiaz and @yaozhewei
@uygnef, thanks for sharing this. Unfortunately, I got the following error from here:
When I replaced with corresponding HF code, I get the following error:
Any thoughts?
@tjruwase Hi Tjruwase, I replaced my own model with the Hugging Face model, but all other parts of the code are the same. You can run this to reproduce the error. Thank you very much for your help.
The stage 1 was successful using this method, but unfortunately stages 2 and 3 were not successful.
error is