transformers: pytorch lightning examples doesn't work in multi gpu's with backend=dp
š Bug
Information
Model I am using (Bert, XLNet ā¦): Bert
Language I am using the model on (English, Chinese ā¦): English
The problem arises when using:
- the official example scripts: run_pl.sh (run_pl_glue.py)
 
The tasks I am working on is:
- an official GLUE/SQUaD task: Glue
 
To reproduce
Steps to reproduce the behavior:
- run_pl.sh script with multi-gpuās (ex:8 gpuās)
 
Expected behavior
Glue training should happen
Environment info
transformersversion: 2.8.0- Platform: Linux
 - Python version: 3.7
 - PyTorch version (GPU?): 1.4
 - Tensorflow version (GPU?):
 - Using GPU in script?: Yes
 - Using distributed or parallel set-up in script?: DataParallel
 
About this issue
- Original URL
 - State: closed
 - Created 4 years ago
 - Comments: 28 (14 by maintainers)
 
@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.
I can confirm that the issue occurs only when using multi-gpuās with dp as backend. Using ddp solves the issues.
I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:
Thanks @sshleifer. Weāre fine using
ddpfor everything ā only need one version to work, not multiple ways to do the same thing. Also according to the docs,ddpis the only one that works with FP16 anyway (have not tested yet, will do soon). https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.htmlIām working off of
transformersfrom GitHub⦠so should be a recent version. If thatās not what you are saying couple you please be more specific?We also donāt necessarily āneedā Lightning⦠but would be great if it worked (in single set of settings) for multi-GPU. As it is great having reasonable out of the box options for LR schedule, model synchronization, gradient accumulation, and all those other things Iāve grown tired of implementing for every project.
I am also facing the error but on a different custom learning model. My code is working properly on a single GPU, however, if I increase the number of GPUs to 2, it gives me the above error. I checked both PL 0.7.3 and 0.7.4rc3
Update: Interestingly when I changed
distributed_backendtoddpthen it worked perfectly without any error I think there is an issue with thedpdistributed_backend@williamFalcon Thanks. Iām running the code as per the given instructions in https://github.com/huggingface/transformers/tree/master/examples/glue I didnāt make any changes, I just ran the same official example script in multi gpuās - https://github.com/huggingface/transformers/blob/master/examples/glue/run_pl.sh
It works in CPU and single GPU, but doesnāt work in multi-gpuās
I get the below error:
@nateraw @williamFalcon