transformers: DDP + gloo + gpt2 crashes
System Info
transformers
version: 4.27.4- Platform: macOS-12.6-arm64-arm-64bit (also have tested on ubuntu)
- Python version: 3.10.9
- Huggingface_hub version: 0.13.3
- PyTorch version (GPU?): 1.13.1 (False) (also have tested on older torch versions)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: yes, see script
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
import transformers
import multiprocessing as mp
import torch.multiprocessing as mp
import os
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# initialize the process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def demo_basic(rank, world_size):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
# create model and move it to GPU with id rank
gpt2 = transformers.AutoModelForCausalLM.from_pretrained('gpt2')
module = DistributedDataParallel(gpt2)
cleanup()
def run_demo(demo_fn, world_size):
mp.spawn(demo_fn,
args=(world_size,),
nprocs=world_size,
join=True)
if __name__ == '__main__':
world_size = 2
run_demo(demo_basic, world_size)
gives
Running basic DDP example on rank 1.
Running basic DDP example on rank 0.
NOTE: Redirects are currently not supported in Windows or MacOs.
NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
File "/Users/danielking/github/composer/scripts/gpt2-dist.py", line 36, in <module>
run_demo(demo_basic, world_size)
File "/Users/danielking/github/composer/scripts/gpt2-dist.py", line 29, in run_demo
mp.spawn(demo_fn,
File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/Users/danielking/github/composer/scripts/gpt2-dist.py", line 24, in demo_basic
module = DistributedDataParallel(gpt2)
File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 657, in __init__
_sync_module_states(
File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/distributed/utils.py", line 136, in _sync_module_states
_sync_params_and_buffers(
File "/Users/danielking/miniconda3/envs/composer-dev-3.10/lib/python3.10/site-packages/torch/distributed/utils.py", line 154, in _sync_params_and_buffers
dist._broadcast_coalesced(
RuntimeError: Invalid scalar type
It looks like the attention bias was changed from torch.uint8
in transformers
version 4.26.1
to torch.bool
in transformers
version 4.27.x
. I’m not sure if I’m doing something wrong, torch has a bug, or transformers has a bug. I don’t use the gloo backend much, and discovered this error from our unit tests when upgrading transformers
version. Thanks for your help!
Expected behavior
DDP wrapping gpt2 works on CPU
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (6 by maintainers)
Okay there’s a hack you can do:
Since you don’t need to sync them, it should work for you. Though the best fix would be to support bool in GLOO backend.
I believe this remains an issue