DeepSpeed: [BUG] DeBERTa has bad performance when using ZERO Stage-3 with continuous warnings "A module has unknown inputs or outputs type"

Describe the bug DeBERTa has bad performance when using ZERO Stage-3 . stdout has continuous warnings

[stage3.py:104:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 
'torch.nn.parameter.Parameter'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before
 or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered proper
ly.

To Reproduce Steps to reproduce the behavior:

Official HF Accelerate run_glue_no_trainer.py script
Setting up DeepSpeed Zero-3 theough command accelerate config. The output config yaml:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

bash script to run the finetuning of microsoft/deberta-v2-xlarge-mnli on MRPC dataset using ZERO Stage-3.

#!/bin/bash

time accelerate launch /home/sourab/deepspeed-test/src/text-classification/run_glue_no_trainer.py \
--task_name "mrpc" \
--max_length 128 \
--model_name_or_path "microsoft/deberta-v2-xlarge-mnli" \
--output_dir "/home/sourab/deepspeed-test/glue/mrpc_deepspeed_stage3_accelerate" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 3.5e-6 \
--weight_decay 0.0 \
--max_grad_norm 1.0 \
--num_train_epochs 6 \
--num_warmup_steps 50 \
--with_tracking \

Relevant output snippets. The first one shows the weird behaviour with continuous warnings. The second shows the eval metrics being worse when compared to setup without using DeepSpeed.

Expected behavior A clear and concise description of what you expected to happen. No contiguous stream of warnings and no performance degradation when using DeepSpeed Stage-3 with DeBERTa.

ds_report output Please run ds_report to give us details about your setup.

--------------------------------------------------                                                                     [0/1948]
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja 
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sourab/dev/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0.dev20220505+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/sourab/dev/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 20.04.3 LTS (Focal Fossa)
GPU count and types: 1 machine with x2 NVIDIA TITAN RTX each
Python version: Python 3.8.10

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? Accelerate launcher which just triggers deepspeed launcher

About this issue

Original URL
State: open
Created 2 years ago
Comments: 19 (14 by maintainers)

Most upvoted comments

Hi again @stas00, perhaps before I make an issue we can quickly check if this is an issue to be discussed in the context of pytorch-lightning (which is what I am using together with deepspeed).

I made a silly example below that shows the issue:

import torch ### 1.9.1 with Cuda 11.1
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer ### using 1.7.7
from pytorch_lightning.callbacks import ModelCheckpoint
from dataclasses import dataclass,field,asdict

class A:
    pass

def my_collator(batch):
    return {
        "data"            : torch.stack(batch,dim=0),
        "batch_meta_data" : [A(),A()],
        "other_stuff"     : [[[10],[]]] ## built-in stuff
    }

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]
    
    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, features):
        x = features["data"]
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss,batch_size=2)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss,batch_size=2)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss,batch_size=2)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


if __name__ == "__main__":
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2,collate_fn=my_collator)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2,collate_fn=my_collator)

    model = BoringModel()
    trainer = Trainer(
        num_sanity_val_steps=0,
        max_epochs=3,
        log_every_n_steps=1,
        strategy="deepspeed_stage_3",
        precision=16,
        accelerator="gpu",
        devices=1,
    )
    
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)

this gives me the warning: [2023-01-04 17:36:20,111] [WARNING] [parameter_offload.py:55:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class '__main__.A'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.

because I have this non-built-in/non-tensor thing (batch_meta_data) in the batch representation.

my specific issue: I sometimes put such custom objects (here simulated by this custom object A() and [[[10],[]]]) alongside my batch tensors for on-the-fly analysis, integration with other models, and a whole host of other complicated things. I still don’t understand what’s going in with _apply_to_tensors_only, but my guess is that it should probably skip these things once it establishes that it has nothing to do with tensors, like it seems to do with built-in objects.

yakazimir on Jan 5, 2023

@pacman100, thanks for sharing your update. I am glad that performance problem is resolved in the latest code. I have created this #1974 to suppress the warning noise. The PR probably needs tweaking such as whether to report this warning some fixed number of times. Right now, it is complete turned off except for debugging mode. Can you please test the PR branch?

tjruwase on May 24, 2022