transformers: Deepspeed hang when tuning redpajama-3b

System Info

transformers-cli says:

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.30.0.dev0
- Platform: Linux-4.18.0-425.10.1.el8_7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.13
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

ds_report says:

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.30.0.dev0
- Platform: Linux-4.18.0-425.10.1.el8_7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.13
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

root@etc-gpu-12:/workspace# ds_report
[2023-06-07 16:33:38,748] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.4+f2f5f21b, f2f5f21b, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

I am running this using a docker image from this dockerfile:

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-04.html#rel_22-04
FROM nvcr.io/nvidia/pytorch:22.04-py3

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-01.html
# This one OOM's on the tune-broken case
# FROM nvcr.io/nvidia/pytorch:23.01-py3

RUN git clone https://github.com/huggingface/transformers.git
RUN pip install transformers/.

RUN pip install git+https://github.com/huggingface/accelerate.git
# RUN git clone https://github.com/huggingface/accelerate.git
# RUN pip install accelerate/.

RUN pip install git+https://github.com/microsoft/DeepSpeed.git
# RUN git clone https://github.com/microsoft/DeepSpeed.git
# RUN pip install deepspeed/.

RUN pip install git+https://github.com/huggingface/peft.git
RUN pip install datasets evaluate loralib --upgrade --quiet
RUN pip install bitsandbytes rouge-score tensorboard py7zr einops py-spy
RUN pip install jupyter
RUN pip uninstall -y apex
RUN pip uninstall -y apex

# This is so we can run the translation test
RUN pip install -r transformers/examples/pytorch/translation/requirements.txt

Who can help?

@pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Summary:

  • I kick off the training script using deepspeed and NO configuration and it fails. I’ve also tried with ds_config_zero3.json from the test directory and it fails too.

My script “tune.py”:

#! /usr/bin/env python3

import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset

from accelerate import Accelerator

MIN_TRANSFORMERS_VERSION = '4.25.1'

assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'

accelerator = Accelerator()

# ==============================================================================
# DDP: Usually we use NCCL, so set that.
# Maybe need to use: NCCL_P2P_DISABLE=1
training_args = TrainingArguments(
    output_dir="redpajama-tuning-test",
    #evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    #log_level="debug",
    report_to="none",
    ddp_backend="nccl",
    ddp_timeout=60,
    push_to_hub=False
)

# =============================================================================

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Base-3B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-Base-3B-v1")
model.train()
model = model.half()
model = model.cuda()

# =============================================================================

tokenizer.model_max_length=512
tokenizer.pad_token = tokenizer.eos_token

eli5 = load_dataset("eli5", split="train_asks[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)
eli5 = eli5.flatten()

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

with training_args.main_process_first(desc="tokenizing"):
    tokenized_eli5 = eli5.map(
        preprocess_function,
        batched=True,
        num_proc=4,
        remove_columns=eli5["train"].column_names
    )

block_size = 512
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

with training_args.main_process_first(desc="grouping"):
    lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

# =================================================================================
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator
)
trainer.train()

I run it with:

deepspeed tune.py

When it deadlock (pretty reproducibly, sometimes it completes) I use py-spy to get the stack traces.

root@etc-gpu-12:/workspace# py-spy dump --pid 4282
Process 4282: /opt/conda/bin/python3.8 -u tune-broken.py --local_rank=0
Python v3.8.13 (/opt/conda/bin/python3.8)

Thread 4282 (active+gil): "MainThread"
    store_flos (transformers/trainer.py:2938)
    _inner_training_loop (transformers/trainer.py:2059)
    train (transformers/trainer.py:1643)
    <module> (tune-broken.py:89)
Thread 4754 (idle): "Thread-4"
    wait (threading.py:306)
    wait (threading.py:558)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 4964 (idle)
Thread 4965 (idle)
root@etc-gpu-12:/workspace# py-spy dump --pid 4283
Process 4283: /opt/conda/bin/python3.8 -u tune-broken.py --local_rank=1
Python v3.8.13 (/opt/conda/bin/python3.8)

Thread 4283 (active): "MainThread"
    forward (transformers/models/gpt_neox/modeling_gpt_neox.py:278)
    _call_impl (torch/nn/modules/module.py:1501)
    forward (transformers/models/gpt_neox/modeling_gpt_neox.py:149)
    _call_impl (torch/nn/modules/module.py:1501)
    forward (transformers/models/gpt_neox/modeling_gpt_neox.py:331)
    _call_impl (torch/nn/modules/module.py:1501)
    forward (transformers/models/gpt_neox/modeling_gpt_neox.py:564)
    _call_impl (torch/nn/modules/module.py:1501)
    forward (transformers/models/gpt_neox/modeling_gpt_neox.py:673)
    _call_impl (torch/nn/modules/module.py:1501)
    _run_ddp_forward (torch/nn/parallel/distributed.py:1110)
    forward (torch/nn/parallel/distributed.py:1156)
    _call_impl (torch/nn/modules/module.py:1501)
    compute_loss (transformers/trainer.py:2763)
    training_step (transformers/trainer.py:2738)
    _inner_training_loop (transformers/trainer.py:1928)
    train (transformers/trainer.py:1643)
    <module> (tune-broken.py:89)
Thread 4824 (idle): "Thread-4"
    wait (threading.py:306)
    wait (threading.py:558)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 4966 (idle)
Thread 4967 (idle)

Expected behavior

That it shouldn’t deadlock.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 26 (8 by maintainers)

Most upvoted comments

Hello,

training for hours isn’t feasible and it won’t be a minimal reproducer if it takes hours. For that reason, changed the following in the above code snippet:

...
- eli5 = load_dataset("eli5", split="train_asks[:5000]")
+ eli5 = load_dataset("eli5", split="train_asks[:100]")

running the above code:

CUDA_VISIBLE_DEVICES=0,1  deepspeed issue_24090.py

Output logs:

[2023-08-29 13:14:58,392] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-29 13:15:00,207] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1: setting --include=localhost:0,1
[2023-08-29 13:15:00,207] [INFO] [runner.py:555:main] cmd = /home/sourab/miniconda3/envs/hf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None issue_24090.py
[2023-08-29 13:15:02,294] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-29 13:15:04,084] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-08-29 13:15:04,084] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-08-29 13:15:04,084] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-08-29 13:15:04,084] [INFO] [launch.py:163:main] dist_world_size=2
[2023-08-29 13:15:04,084] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-08-29 13:15:07,256] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-29 13:15:07,301] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-29 13:15:08,834] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-29 13:15:08,834] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-29 13:15:08,860] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-29 13:15:08,860] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-29 13:15:08,860] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-29 13:15:17,771] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 2.78B parameters
Found cached dataset eli5 (/raid/sourab/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa)
Found cached dataset eli5 (/raid/sourab/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa)
Map (num_proc=4):   0%|                                                                          | 0/80 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (790 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (894 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (521 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (602 > 512). Running this sequence through the model will result in indexing errors
Map (num_proc=4):   0%|                                                                          | 0/20 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1481 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (815 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (863 > 512). Running this sequence through the model will result in indexing errors
Map (num_proc=4):   0%|                                                                          | 0/80 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (624 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (987 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1481 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (579 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (662 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (698 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1956 > 512). Running this sequence through the model will result in indexing errors
/home/sourab/transformers/src/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/sourab/transformers/src/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2023-08-29 13:15:23,022] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-29 13:15:23,593] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /raid/sourab/.cache/huggingface/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /raid/sourab/.cache/huggingface/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.2999038696289062 seconds
Using /raid/sourab/.cache/huggingface/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /raid/sourab/.cache/huggingface/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.302565336227417 seconds
Parameter Offload: Total persistent parameters: 1070080 in 258 params
  0%|                                                                                                   | 0/18 [00:00<?, ?it/s]You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 11%|██████████                                                                                 | 2/18 [00:09<01:16,  4.78s/it]/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'train_runtime': 80.1289, 'train_samples_per_second': 1.797, 'train_steps_per_second': 0.225, 'train_loss': 2.0953504774305554, 'epoch': 3.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [01:20<00:00,  4.45s/it]
[2023-08-29 13:17:03,208] [INFO] [launch.py:347:main] Process 3032445 exits successfully.
[2023-08-29 13:17:04,210] [INFO] [launch.py:347:main] Process 3032446 exits successfully.

Therefore, working fine and as expected.

I have! I have used all the configs as reported in the issue and I still get the problem. However the problem I get is a deadlock near the evaluation time, or near the end of training epoch. So, not when it starts up. As with @pacman100 I can start the training, it just won’t complete. So I am trying to get us to the same reproducible case, meaning we are both using half precision (because I only have 40G cards, not the 80G ones you folks have) and it runs all the way to the end.