transformers: pytorch lightning examples doesn't work in multi gpu's with backend=dp

šŸ› Bug

Information

Model I am using (Bert, XLNet …): Bert

Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • the official example scripts: run_pl.sh (run_pl_glue.py)

The tasks I am working on is:

  • an official GLUE/SQUaD task: Glue

To reproduce

Steps to reproduce the behavior:

  1. run_pl.sh script with multi-gpu’s (ex:8 gpu’s)

Expected behavior

Glue training should happen

Environment info

  • transformers version: 2.8.0
  • Platform: Linux
  • Python version: 3.7
  • PyTorch version (GPU?): 1.4
  • Tensorflow version (GPU?):
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: DataParallel

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 28 (14 by maintainers)

Most upvoted comments

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.

I can confirm that the issue occurs only when using multi-gpu’s with dp as backend. Using ddp solves the issues.

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0,1]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/warnings.py:18: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "run_pl_glue.py", line 187, in <module>
    trainer = generic_train(model, args)
  File "/home/jupyter/transformers/examples/transformer_base.py", line 310, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 734, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects

Thanks @sshleifer. We’re fine using ddp for everything – only need one version to work, not multiple ways to do the same thing. Also according to the docs, ddp is the only one that works with FP16 anyway (have not tested yet, will do soon). https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html

I’m working off of transformers from GitHub… so should be a recent version. If that’s not what you are saying couple you please be more specific?

We also don’t necessarily ā€œneedā€ Lightning… but would be great if it worked (in single set of settings) for multi-GPU. As it is great having reasonable out of the box options for LR schedule, model synchronization, gradient accumulation, and all those other things I’ve grown tired of implementing for every project.

I am also facing the error but on a different custom learning model. My code is working properly on a single GPU, however, if I increase the number of GPUs to 2, it gives me the above error. I checked both PL 0.7.3 and 0.7.4rc3

Update: Interestingly when I changed distributed_backend to ddp then it worked perfectly without any error I think there is an issue with the dp distributed_backend

@williamFalcon Thanks. I’m running the code as per the given instructions in https://github.com/huggingface/transformers/tree/master/examples/glue I didn’t make any changes, I just ran the same official example script in multi gpu’s - https://github.com/huggingface/transformers/blob/master/examples/glue/run_pl.sh
It works in CPU and single GPU, but doesn’t work in multi-gpu’s

I get the below error:

Validation sanity check:   0%|                                                                                                                | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_pl_glue.py", line 186, in <module>
    trainer = generic_train(model, args)
  File "/home/jupyter/transformers/examples/transformer_base.py", line 307, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 701, in fit
    self.dp_train(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 540, in dp_train
    self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 843, in run_pretrain_routine
    False)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 262, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 430, in evaluation_forward
    output = model(*args)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
    assert all(map(lambda i: i.is_cuda, inputs))
AssertionError

@nateraw @williamFalcon