transformers: Error using DataParallel with reformer model: There were no tensor arguments to this function
🐛 Bug
Information
I’m having some issues using DataParallel with the reformer model with 4 GPUs. I am trying to feed the ReformerModel input embeddings, and output the last hidden state. I am using apex amp, however I get the same error when I don’t use amp. I also get the same error when I use input IDs, rather than embeddings. And I’ve tested the same script using other HuggingFace models with no issues (Bert, and Roberta).
To reproduce
Simple code:
import torch
from apex import amp
import transformers
from transformers import ReformerModel
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
print(transformers.__version__)
print(torch.__version__)
device = torch.device("cuda:0")
batch_size = 4
model_rf = ReformerModel.from_pretrained('google/reformer-crime-and-punishment')
model_rf.to(device)
opt_rf = torch.optim.AdamW(model_rf.parameters(), lr=0.0002)
model_rf, opt_rf = amp.initialize(model_rf, opt_rf)
model_rf = nn.DataParallel(model_rf)
embeds = torch.randn(80, 64, 256)
training_set = TensorDataset(embeds, embeds)
training_generator = DataLoader(training_set, batch_size=batch_size, shuffle=True)
for i, batch in enumerate(training_generator):
embeds, _ = batch
h_final = model_rf(inputs_embeds=embeds.to(device))
And the error:
Traceback (most recent call last):
File "rf_4.py", line 35, in <module>
h_final = model_rf(inputs_embeds=embeds)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_reformer.py", line 1621, in forward
embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, inputs_embeds=inputs_embeds)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_reformer.py", line 234, in forward
position_embeddings = self.position_embeddings(position_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_reformer.py", line 170, in forward
[weight[:, :required_pos_encodings_columns] for weight in broadcasted_weights], dim=-1
File "/usr/local/lib/python3.6/dist-packages/apex/amp/wrap.py", line 81, in wrapper
return orig_fn(seq, *args, **kwargs)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat. This usually means that this function requires a non-empty list of Tensors. Available functions are [CPUTensorId, CUDATensorId, QuantizedCPUTensorId, VariableTensorId]
Expected behavior
This code kicks an error at the h_final line
Environment info
transformers
version: 3.0.2- Platform: Ubuntu 18.04
- Python version: 3.6.9
- PyTorch version (GPU?): 1.5.1
- Tensorflow version (GPU?): no
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: Yes, 4 GPUs
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 17 (8 by maintainers)
I face the same error when using multi GPUs on Reformer model:
@jstremme Yes, I’m sure that torch 1.4.0 with multiple GPUs worked for Reformer training. AFAICT it’s not impacting anything else.
In my case there’s no error using torch-1.4.0, but got a warning:
Found a relevant issue : https://github.com/huggingface/transformers/issues/852 https://discuss.pytorch.org/t/how-to-fix-gathering-dim-0-warning-in-multi-gpu-dataparallel-setting/41733/2
Was this issue ever solved? I have managed to use multiple GPUs in Reformer training by downgrading to PyTorch 1.4.0 and transformers 3.0.2. However, I would like to not be constrained to this version setup, because it is leading to some inefficiencies (functions for which arguments have changed in the new version, etc.) and also because I’d like the version to be up to date.
@jstremme I have these installed:
Package Version
argh 0.26.2 certifi 2020.6.20 chardet 3.0.4 click 7.1.2 configparser 5.0.0 docker-pycreds 0.4.0 filelock 3.0.12 gitdb 4.0.5 GitPython 3.1.7 gql 0.2.0 graphql-core 1.1 idna 2.10 joblib 0.16.0 numpy 1.18.4 nvidia-ml-py3 7.352.0 packaging 20.4 pathtools 0.1.2 pip 19.1.1 promise 2.3 psutil 5.7.0 pyparsing 2.4.7 python-dateutil 2.8.1 PyYAML 5.3.1 regex 2019.11.1 requests 2.24.0 sacremoses 0.0.43 sentencepiece 0.1.90 sentry-sdk 0.16.1 setuptools 41.0.1 shortuuid 1.0.1 six 1.15.0 smmap 3.0.4 subprocess32 3.5.3 tokenizers 0.8.1rc1 torch 1.4.0 tqdm 4.47.0 transformers 3.0.2 urllib3 1.25.9 wandb 0.9.3 watchdog 0.9.0 wheel 0.33.4