transformers: DeepSpeed gets stuck when training
Environment info
transformersversion: 4.8.1- Platform: Linux-4.15.0-140-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.10
- PyTorch version (GPU?): 1.9.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: single gpu
Who can help
Information
Trying to replicate this, I am using a 125M GPT Neo model and fine-tune it with using the Trainer. Training arguments include a DeepSpeed option. The Trainer gets stuck with:
[2021-06-29 14:29:44,747] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.4.1, git-hash=unknown, git-branch=unknown
[2021-06-29 14:29:44,757] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
ds_report gives:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/anaconda3/envs/esemala/lib/python3.7/site-packages/torch']
torch version .................... 1.9.0
torch cuda version ............... 11.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/jovyan/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.4.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.1
Is there a way to debug this?
To Replicate
I modified the original code slightly to remove the errors:
training_args = tr.TrainingArguments(output_dir=save_dir, num_train_epochs=5, logging_steps=300, save_steps=300,
per_device_train_batch_size=1, per_device_eval_batch_size=1,warmup_steps=50,
learning_rate=0.001,adam_epsilon=1e-06,fp16=True,
weight_decay=0.01, logging_dir=f'{save_dir}/logs', deepspeed='./ds_config.json')
and ds_config.json is now:
{
"fp16": {
"enabled": true,
"min_loss_scale": 1,
"opt_level": "O3"
},
"zero_optimization": {
"stage": 3,
"cpu_offload": true,
"cpu_offload_params" : true,
"contiguous_gradients": true,
"overlap_comm": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.001,
"betas": [
0.9,
0.999
],
"eps": 1e-6
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 50
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"steps_per_print":1
}
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 30 (13 by maintainers)
I think you could try this solution:
rm -rf ~/.cache/torch_extensions/ref: https://github.com/huggingface/transformers/issues/12715
try
export NCCL_P2P_DISABLE=1, it works for me.Is this already solved? I also have this problem when training inside pod.
@stas00
yes turning on gradients doesn’t make any sense. I was attempting to battle the issue with using ‘gloo’ backend that you referred to… not sure how to fix it https://github.com/microsoft/DeepSpeed/issues/1030
I’m not succeeding at building that Docker image. If I use
build_image.shit hangs, if I try todocker build .it fails with some deps missing. Do you have a ready docker image I could pull?Since kubeflow is run in a docker image most likely the issue has something to do with its setup/configuration.
It’s very possible. I haven’t run into this myself, so I trust your research.
gloo doesn’t provide the same functionality as nccl, but it looks that Deepspeed docs say it should work.
OK, what if you do:
deepspeed.init_distributed("gloo")here? instead ofdeepspeed.init_distributed()https://github.com/huggingface/transformers/blob/d7e156bd1ae2467e9ea1dbc44f31da0ed2296aee/src/transformers/training_args.py#L812
I found this issue https://github.com/microsoft/DeepSpeed/issues/1030 where a user was able to use the gloo backend with Deepspeed.
Thanks @stas00, that’s very detailed!
cat /proc/sys/kernel/yama/ptrace_scopeyields1so ill do it withfaulthandler.Accidentally found out that when removing DeepSpeed option from trainer, it still gets stuck. Removing
starts training as expected again. I also tried letting the settings to be discovered via
mpi4py, as you wrote in the original post, it saysmpi4pyneeds to be installed (can’t install as I needsudo…again). Could it be all due to the fact that I’m running things not on my own machine directly but usingkubeflownotebook server?I have dumped the traceback files from all 3 experiments into the same repo.
FP16is on during all of them.No settingsmeans thatos.environis commented out. I have also labeled the start of training with\n\nNow training\n\n.Thanks again