transformers: Deepspeed hang when tuning redpajama-3b
System Info
transformers-cli says:
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- `transformers` version: 4.30.0.dev0
- Platform: Linux-4.18.0-425.10.1.el8_7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.13
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
ds_report says:
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- `transformers` version: 4.30.0.dev0
- Platform: Linux-4.18.0-425.10.1.el8_7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.13
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
root@etc-gpu-12:/workspace# ds_report
[2023-06-07 16:33:38,748] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.4+f2f5f21b, f2f5f21b, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6
I am running this using a docker image from this dockerfile:
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-04.html#rel_22-04
FROM nvcr.io/nvidia/pytorch:22.04-py3
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-01.html
# This one OOM's on the tune-broken case
# FROM nvcr.io/nvidia/pytorch:23.01-py3
RUN git clone https://github.com/huggingface/transformers.git
RUN pip install transformers/.
RUN pip install git+https://github.com/huggingface/accelerate.git
# RUN git clone https://github.com/huggingface/accelerate.git
# RUN pip install accelerate/.
RUN pip install git+https://github.com/microsoft/DeepSpeed.git
# RUN git clone https://github.com/microsoft/DeepSpeed.git
# RUN pip install deepspeed/.
RUN pip install git+https://github.com/huggingface/peft.git
RUN pip install datasets evaluate loralib --upgrade --quiet
RUN pip install bitsandbytes rouge-score tensorboard py7zr einops py-spy
RUN pip install jupyter
RUN pip uninstall -y apex
RUN pip uninstall -y apex
# This is so we can run the translation test
RUN pip install -r transformers/examples/pytorch/translation/requirements.txt
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Summary:
- I kick off the training script using deepspeed and NO configuration and it fails. I’ve also tried with
ds_config_zero3.json
from the test directory and it fails too.
My script “tune.py”:
#! /usr/bin/env python3
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from accelerate import Accelerator
MIN_TRANSFORMERS_VERSION = '4.25.1'
assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
accelerator = Accelerator()
# ==============================================================================
# DDP: Usually we use NCCL, so set that.
# Maybe need to use: NCCL_P2P_DISABLE=1
training_args = TrainingArguments(
output_dir="redpajama-tuning-test",
#evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
#log_level="debug",
report_to="none",
ddp_backend="nccl",
ddp_timeout=60,
push_to_hub=False
)
# =============================================================================
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Base-3B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-Base-3B-v1")
model.train()
model = model.half()
model = model.cuda()
# =============================================================================
tokenizer.model_max_length=512
tokenizer.pad_token = tokenizer.eos_token
eli5 = load_dataset("eli5", split="train_asks[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)
eli5 = eli5.flatten()
def preprocess_function(examples):
return tokenizer([" ".join(x) for x in examples["answers.text"]])
with training_args.main_process_first(desc="tokenizing"):
tokenized_eli5 = eli5.map(
preprocess_function,
batched=True,
num_proc=4,
remove_columns=eli5["train"].column_names
)
block_size = 512
def group_texts(examples):
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
if total_length >= block_size:
total_length = (total_length // block_size) * block_size
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
with training_args.main_process_first(desc="grouping"):
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
# =================================================================================
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset["train"],
eval_dataset=lm_dataset["test"],
tokenizer=tokenizer,
data_collator=data_collator
)
trainer.train()
I run it with:
deepspeed tune.py
When it deadlock (pretty reproducibly, sometimes it completes) I use py-spy to get the stack traces.
root@etc-gpu-12:/workspace# py-spy dump --pid 4282
Process 4282: /opt/conda/bin/python3.8 -u tune-broken.py --local_rank=0
Python v3.8.13 (/opt/conda/bin/python3.8)
Thread 4282 (active+gil): "MainThread"
store_flos (transformers/trainer.py:2938)
_inner_training_loop (transformers/trainer.py:2059)
train (transformers/trainer.py:1643)
<module> (tune-broken.py:89)
Thread 4754 (idle): "Thread-4"
wait (threading.py:306)
wait (threading.py:558)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 4964 (idle)
Thread 4965 (idle)
root@etc-gpu-12:/workspace# py-spy dump --pid 4283
Process 4283: /opt/conda/bin/python3.8 -u tune-broken.py --local_rank=1
Python v3.8.13 (/opt/conda/bin/python3.8)
Thread 4283 (active): "MainThread"
forward (transformers/models/gpt_neox/modeling_gpt_neox.py:278)
_call_impl (torch/nn/modules/module.py:1501)
forward (transformers/models/gpt_neox/modeling_gpt_neox.py:149)
_call_impl (torch/nn/modules/module.py:1501)
forward (transformers/models/gpt_neox/modeling_gpt_neox.py:331)
_call_impl (torch/nn/modules/module.py:1501)
forward (transformers/models/gpt_neox/modeling_gpt_neox.py:564)
_call_impl (torch/nn/modules/module.py:1501)
forward (transformers/models/gpt_neox/modeling_gpt_neox.py:673)
_call_impl (torch/nn/modules/module.py:1501)
_run_ddp_forward (torch/nn/parallel/distributed.py:1110)
forward (torch/nn/parallel/distributed.py:1156)
_call_impl (torch/nn/modules/module.py:1501)
compute_loss (transformers/trainer.py:2763)
training_step (transformers/trainer.py:2738)
_inner_training_loop (transformers/trainer.py:1928)
train (transformers/trainer.py:1643)
<module> (tune-broken.py:89)
Thread 4824 (idle): "Thread-4"
wait (threading.py:306)
wait (threading.py:558)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 4966 (idle)
Thread 4967 (idle)
Expected behavior
That it shouldn’t deadlock.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 26 (8 by maintainers)
Hello,
training for hours isn’t feasible and it won’t be a minimal reproducer if it takes hours. For that reason, changed the following in the above code snippet:
running the above code:
Output logs:
Therefore, working fine and as expected.
I have! I have used all the configs as reported in the issue and I still get the problem. However the problem I get is a deadlock near the evaluation time, or near the end of training epoch. So, not when it starts up. As with @pacman100 I can start the training, it just won’t complete. So I am trying to get us to the same reproducible case, meaning we are both using half precision (because I only have 40G cards, not the 80G ones you folks have) and it runs all the way to the end.