DeepSpeed: [BUG] DeBERTa has bad performance when using ZERO Stage-3 with continuous warnings "A module has unknown inputs or outputs type"
Describe the bug DeBERTa has bad performance when using ZERO Stage-3 . stdout has continuous warnings
[stage3.py:104:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class
'torch.nn.parameter.Parameter'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before
or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered proper
ly.
To Reproduce Steps to reproduce the behavior:
- Official HF Accelerate
run_glue_no_trainer.py
script - Setting up DeepSpeed Zero-3 theough command
accelerate config
. The output config yaml:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
- bash script to run the finetuning of
microsoft/deberta-v2-xlarge-mnli
on MRPC dataset using ZERO Stage-3.
#!/bin/bash
time accelerate launch /home/sourab/deepspeed-test/src/text-classification/run_glue_no_trainer.py \
--task_name "mrpc" \
--max_length 128 \
--model_name_or_path "microsoft/deberta-v2-xlarge-mnli" \
--output_dir "/home/sourab/deepspeed-test/glue/mrpc_deepspeed_stage3_accelerate" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 3.5e-6 \
--weight_decay 0.0 \
--max_grad_norm 1.0 \
--num_train_epochs 6 \
--num_warmup_steps 50 \
--with_tracking \
- Relevant output snippets. The first one shows the weird behaviour with continuous warnings. The second shows the eval metrics being worse when compared to setup without using DeepSpeed.
Expected behavior A clear and concise description of what you expected to happen. No contiguous stream of warnings and no performance degradation when using DeepSpeed Stage-3 with DeBERTa.
ds_report output
Please run ds_report
to give us details about your setup.
-------------------------------------------------- [0/1948]
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sourab/dev/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0.dev20220505+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/sourab/dev/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 20.04.3 LTS (Focal Fossa)
- GPU count and types: 1 machine with x2 NVIDIA TITAN RTX each
- Python version: Python 3.8.10
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
Accelerate launcher which just triggers deepspeed
launcher
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 19 (14 by maintainers)
Hi again @stas00, perhaps before I make an issue we can quickly check if this is an issue to be discussed in the context of pytorch-lightning (which is what I am using together with deepspeed).
I made a silly example below that shows the issue:
this gives me the warning:
[2023-01-04 17:36:20,111] [WARNING] [parameter_offload.py:55:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class '__main__.A'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
because I have this non-built-in/non-tensor thing (
batch_meta_data
) in the batch representation.my specific issue: I sometimes put such custom objects (here simulated by this custom object
A()
and[[[10],[]]]
) alongside my batch tensors for on-the-fly analysis, integration with other models, and a whole host of other complicated things. I still don’t understand what’s going in with_apply_to_tensors_only
, but my guess is that it should probably skip these things once it establishes that it has nothing to do with tensors, like it seems to do with built-in objects.@pacman100, thanks for sharing your update. I am glad that performance problem is resolved in the latest code. I have created this #1974 to suppress the warning noise. The PR probably needs tweaking such as whether to report this warning some fixed number of times. Right now, it is complete turned off except for debugging mode. Can you please test the PR branch?