transformers: DDP + gloo + gpt2 crashes

System Info

  • transformers version: 4.27.4
  • Platform: macOS-12.6-arm64-arm-64bit (also have tested on ubuntu)
  • Python version: 3.10.9
  • Huggingface_hub version: 0.13.3
  • PyTorch version (GPU?): 1.13.1 (False) (also have tested on older torch versions)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: yes, see script

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
import transformers
import multiprocessing as mp
import torch.multiprocessing as mp
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    gpt2 = transformers.AutoModelForCausalLM.from_pretrained('gpt2')
    module = DistributedDataParallel(gpt2)

    cleanup()

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == '__main__':
    world_size = 2
    run_demo(demo_basic, world_size)

gives

Running basic DDP example on rank 1.
Running basic DDP example on rank 0.
NOTE: Redirects are currently not supported in Windows or MacOs.
NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
  File "/Users/danielking/github/composer/scripts/gpt2-dist.py", line 36, in <module>
    run_demo(demo_basic, world_size)
  File "/Users/danielking/github/composer/scripts/gpt2-dist.py", line 29, in run_demo
    mp.spawn(demo_fn,
  File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/Users/danielking/github/composer/scripts/gpt2-dist.py", line 24, in demo_basic
    module = DistributedDataParallel(gpt2)
  File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 657, in __init__
    _sync_module_states(
  File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/distributed/utils.py", line 136, in _sync_module_states
    _sync_params_and_buffers(
  File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/distributed/utils.py", line 154, in _sync_params_and_buffers
    dist._broadcast_coalesced(
RuntimeError: Invalid scalar type

It looks like the attention bias was changed from torch.uint8 in transformers version 4.26.1 to torch.bool in transformers version 4.27.x. I’m not sure if I’m doing something wrong, torch has a bug, or transformers has a bug. I don’t use the gloo backend much, and discovered this error from our unit tests when upgrading transformers version. Thanks for your help!

Expected behavior

DDP wrapping gpt2 works on CPU

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

Okay there’s a hack you can do:

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
import transformers
import multiprocessing as mp
import torch.multiprocessing as mp
import os
import torch

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    gpt2 = transformers.AutoModelForCausalLM.from_pretrained("gpt2")
    gpt2._ddp_params_and_buffers_to_ignore = [name for name, buffer in gpt2.named_buffers() if buffer.dtype == torch.bool] # This is the trick, you ask DDP to ignore all buffers that are in torch.bool because GLOO doesn't support bool.
    module = DistributedDataParallel(gpt2)

    cleanup()

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == '__main__':
    world_size = 2
    run_demo(demo_basic, world_size)

Since you don’t need to sync them, it should work for you. Though the best fix would be to support bool in GLOO backend.

I believe this remains an issue